GeStore : incremental computation for metagenomic pipelines

Genomics is the study of the genomes of organisms. Metagenomics is the study of environmental genomic samples. For both genomics and metagenomics DNA sequencing, and the analysis of these sequences, is an important tool. This analysis is done through integration of sequence data with existing meta-d...

Full description

Bibliographic Details
Main Author: Pedersen, Edvard
Format: Master Thesis
Language:English
Published: Universitetet i Tromsø 2012
Subjects:
Online Access:https://hdl.handle.net/10037/4272
id ftunivtroemsoe:oai:munin.uit.no:10037/4272
record_format openpolar
institution Open Polar
collection University of Tromsø: Munin Open Research Archive
op_collection_id ftunivtroemsoe
language English
topic VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426
VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426
INF-3990
spellingShingle VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426
VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426
INF-3990
Pedersen, Edvard
GeStore : incremental computation for metagenomic pipelines
topic_facet VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426
VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426
INF-3990
description Genomics is the study of the genomes of organisms. Metagenomics is the study of environmental genomic samples. For both genomics and metagenomics DNA sequencing, and the analysis of these sequences, is an important tool. This analysis is done through integration of sequence data with existing meta-data collections. Genomics is the study of the genomes of organisms, and involves cultivating organisms in a lab and analyzing them. Metagenomics is the study of genomic samples collected directly from the environment, allowing researchers to study organisms that are difficult to cultivate in a petri dish. DNA sequencing and the analysis of these sequences is an important tool for both genomics and metagenomics. The integration of the data produced by sequencing with existing meta-data collections is particularly interesting for metagenomics, as a single biological sample can contain thousands of different organisms. The recent developments in DNA sequencing technology mean that the volume of data that can be produced per dollar is increasing faster than the volume of data that can be analyzed and stored per dollar. This data growth means that the initial analysis of these massive data sets becomes increasingly expensive. In addition, there is a need to periodically update old results using new meta-data from the many knowledge bases (meta-data collections) for biological data. Today, this typically requires rerunning the experimental analysis. Such incremental analysis is interesting for metagenomics since environmental samples potentially contain thousands of organisms. In metagenomic analysis, different sets of tools are used depending on the type of information required. These tools are generally arranged in a pipeline, where the output files of one tool acts as the input for the next. The analysis done by some steps is dependent on different meta-data collections. When meta-data is updated, these steps and all subsequent steps typically need to be executed again. Incremental updates can save significant computation time by running these pipelines against the updated segments, rather than the full meta-data collections. We believe that systems for incremental updates for metagenomic analysis pipelines have the following requirements; (i) reduce the computational resource requirements by using incremental update techniques (ii) the meta-data collections should be accessible without the use of proprietary or computationally expensive techniques (iii) do the incremental updates on demand, due to different needs of experiments, through handling meta-data updates and generating arbitrary delta meta-data collections (iv) support most genomic analysis tools and run on most job management systems (v) no changes should be made to the tools that the pipeline is comprised of, since modifying the many available tools is impractical (vi) the changes to the job management and resource allocation system should be minimal, to save implementation time for the pipeline system maintainer (vii) maintain a view of previous meta-data collections, so old experiments can be repeated with the correct meta-data collection version. To our knowledge no existing incremental update systems satisfy all seven requirements. Often they do not support on-demand processing or maintaining views of old data, in addition many systems require computations to be done within a specific framework or programming language. In this thesis we describe the GeStore incremental update system which satisfies all seven requirements. GeStore reduces the size of the meta-data collections, and thus the computational requirements for the pipeline, by leveraging incremental update techniques, satisfying requirements (i) and (iii). In addition it reduces the storage requirements of the meta-data collections, while still maintaining a complete view of the meta-data collection in a plain-text format, fulfilling requirement (ii) and (vii). It also presents a simple interface to the application programmer, so that integrating the system with existing pipeline solutions does not require large changes to the pipeline system or tools, in accordance with requirements (vi), (iv) and (v). GeStore has been implemented using the MapReduce framework, along with HBase, to provide scalable meta-data processing. We demonstrate the system by generating subsets of meta-data collections for use by the widely used genomic tool BLAST. In our evaluation, we have integrated GeStore with an existing pipelining system, GePan; a metagenomic pipeline system developed for a local biotech company in Tromsø, Norway, and used real-world data to evaluate the performance and benefits of GeStore. Our experimental results show that GeStore is able to reduce the runtime of the incremental updates by up to 65\% when compared to unmodified GePan, while introducing a low storage overhead and requiring minimal changes to GePan. We beleive that efficient on-demand updates of metagenomic data, as provided by GeStore, will be useful to our biology collaborators.
format Master Thesis
author Pedersen, Edvard
author_facet Pedersen, Edvard
author_sort Pedersen, Edvard
title GeStore : incremental computation for metagenomic pipelines
title_short GeStore : incremental computation for metagenomic pipelines
title_full GeStore : incremental computation for metagenomic pipelines
title_fullStr GeStore : incremental computation for metagenomic pipelines
title_full_unstemmed GeStore : incremental computation for metagenomic pipelines
title_sort gestore : incremental computation for metagenomic pipelines
publisher Universitetet i Tromsø
publishDate 2012
url https://hdl.handle.net/10037/4272
geographic Norway
Tromsø
geographic_facet Norway
Tromsø
genre Tromsø
genre_facet Tromsø
op_relation https://hdl.handle.net/10037/4272
URN:NBN:no-uit_munin_3987
op_rights openAccess
Copyright 2012 The Author(s)
_version_ 1766220927890096128
spelling ftunivtroemsoe:oai:munin.uit.no:10037/4272 2023-05-15T18:35:28+02:00 GeStore : incremental computation for metagenomic pipelines Pedersen, Edvard 2012-06 https://hdl.handle.net/10037/4272 eng eng Universitetet i Tromsø University of Tromsø https://hdl.handle.net/10037/4272 URN:NBN:no-uit_munin_3987 openAccess Copyright 2012 The Author(s) VDP::Mathematics and natural science: 400::Information and communication science: 420::System development and system design: 426 VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Systemutvikling og -arbeid: 426 INF-3990 Master thesis Mastergradsoppgave 2012 ftunivtroemsoe 2021-06-25T17:53:20Z Genomics is the study of the genomes of organisms. Metagenomics is the study of environmental genomic samples. For both genomics and metagenomics DNA sequencing, and the analysis of these sequences, is an important tool. This analysis is done through integration of sequence data with existing meta-data collections. Genomics is the study of the genomes of organisms, and involves cultivating organisms in a lab and analyzing them. Metagenomics is the study of genomic samples collected directly from the environment, allowing researchers to study organisms that are difficult to cultivate in a petri dish. DNA sequencing and the analysis of these sequences is an important tool for both genomics and metagenomics. The integration of the data produced by sequencing with existing meta-data collections is particularly interesting for metagenomics, as a single biological sample can contain thousands of different organisms. The recent developments in DNA sequencing technology mean that the volume of data that can be produced per dollar is increasing faster than the volume of data that can be analyzed and stored per dollar. This data growth means that the initial analysis of these massive data sets becomes increasingly expensive. In addition, there is a need to periodically update old results using new meta-data from the many knowledge bases (meta-data collections) for biological data. Today, this typically requires rerunning the experimental analysis. Such incremental analysis is interesting for metagenomics since environmental samples potentially contain thousands of organisms. In metagenomic analysis, different sets of tools are used depending on the type of information required. These tools are generally arranged in a pipeline, where the output files of one tool acts as the input for the next. The analysis done by some steps is dependent on different meta-data collections. When meta-data is updated, these steps and all subsequent steps typically need to be executed again. Incremental updates can save significant computation time by running these pipelines against the updated segments, rather than the full meta-data collections. We believe that systems for incremental updates for metagenomic analysis pipelines have the following requirements; (i) reduce the computational resource requirements by using incremental update techniques (ii) the meta-data collections should be accessible without the use of proprietary or computationally expensive techniques (iii) do the incremental updates on demand, due to different needs of experiments, through handling meta-data updates and generating arbitrary delta meta-data collections (iv) support most genomic analysis tools and run on most job management systems (v) no changes should be made to the tools that the pipeline is comprised of, since modifying the many available tools is impractical (vi) the changes to the job management and resource allocation system should be minimal, to save implementation time for the pipeline system maintainer (vii) maintain a view of previous meta-data collections, so old experiments can be repeated with the correct meta-data collection version. To our knowledge no existing incremental update systems satisfy all seven requirements. Often they do not support on-demand processing or maintaining views of old data, in addition many systems require computations to be done within a specific framework or programming language. In this thesis we describe the GeStore incremental update system which satisfies all seven requirements. GeStore reduces the size of the meta-data collections, and thus the computational requirements for the pipeline, by leveraging incremental update techniques, satisfying requirements (i) and (iii). In addition it reduces the storage requirements of the meta-data collections, while still maintaining a complete view of the meta-data collection in a plain-text format, fulfilling requirement (ii) and (vii). It also presents a simple interface to the application programmer, so that integrating the system with existing pipeline solutions does not require large changes to the pipeline system or tools, in accordance with requirements (vi), (iv) and (v). GeStore has been implemented using the MapReduce framework, along with HBase, to provide scalable meta-data processing. We demonstrate the system by generating subsets of meta-data collections for use by the widely used genomic tool BLAST. In our evaluation, we have integrated GeStore with an existing pipelining system, GePan; a metagenomic pipeline system developed for a local biotech company in Tromsø, Norway, and used real-world data to evaluate the performance and benefits of GeStore. Our experimental results show that GeStore is able to reduce the runtime of the incremental updates by up to 65\% when compared to unmodified GePan, while introducing a low storage overhead and requiring minimal changes to GePan. We beleive that efficient on-demand updates of metagenomic data, as provided by GeStore, will be useful to our biology collaborators. Master Thesis Tromsø University of Tromsø: Munin Open Research Archive Norway Tromsø