Decoding Phenotypes via Transcriptomics and Proteomics: Cancer and beyond

While genomics approaches are important in studying host phenotype alterations in response to environmental changes or disease, proteomics approaches offer a complementary perspective by providing a direct readout of expressed functional pathways. Proteogenomic strategies utilizing RNA-sequencing da...

Full description

Bibliographic Details
Main Author: Lin, Miin Sophia
Other Authors: Bafna, Vineet
Format: Software
Language:English
Published: eScholarship, University of California 2022
Subjects:
Online Access:https://escholarship.org/uc/item/1s98z674
Description
Summary:While genomics approaches are important in studying host phenotype alterations in response to environmental changes or disease, proteomics approaches offer a complementary perspective by providing a direct readout of expressed functional pathways. Proteogenomic strategies utilizing RNA-sequencing data to construct splice graph databases have been used in a variety of applications to identify novel splice junctions and mutated peptides. The work in this dissertation begins with the integration of splice databases into a proteogenomic pipeline for the validation of the recently released annotation of the Atlantic salmon genome, and the validation of primary hepatocytes as in vitro models for salmon toxicity studies. Searching in-house generated LC-MS/MS datasets against splice databases constructed from publicly available and in-house-generated salmon transcriptomics data, our proteogenomic pipeline identified 183 events in support of 71 transcript predictions. These included novel genes, corrections to current annotations, and support for Ensembl transcripts. In addition to host-expressed proteins, microbial-expressed proteins can also alter host phenotype. In the absence of prior taxonomic information, tandem mass spectra would be searched against large pan-microbial databases, requiring heavy computational workload and reducing sensitivity. Using both software and algorithmic methods, we developed ProteoStorm, an efficient database search framework for large-scale metaproteomics studies, that significantly reduced runtime from 22 weeks to 9.7 hours while retaining 96% of peptide identifications when compared to MSGF+. A reanalysis of a urinary tract infection dataset revealed a complex pattern of polymicrobial expression, including previously identified microbes. In the final chapter, we used transcriptomics data from TCGA to identify a set of genes that may be involved in the maintenance of ecDNA amplicons in cancer. Specifically, we applied the Boruta algorithm, which incorporates the Random Forest classifier ...