XFams

The new database includes viral sequences from two metagenomic datasets. The first is the Global Ocean Viromes 2.0 (GOV 2.0), which was derived from 145 samples distributed across the world’s oceans, covering different depth layers and spanning from pole to pole (Gregory et al., 2019). The second da...

Full description

Bibliographic Details
Main Author: Zayad, Ahmed
Format: Dataset
Language:unknown
Published: CyVerse Data Commons 2020
Subjects:
ren
Online Access:https://dx.doi.org/10.25739/s1k8-ns03
https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/Xfams/Zayed_XfamsXC_Jul2020
Description
Summary:The new database includes viral sequences from two metagenomic datasets. The first is the Global Ocean Viromes 2.0 (GOV 2.0), which was derived from 145 samples distributed across the world’s oceans, covering different depth layers and spanning from pole to pole (Gregory et al., 2019). The second dataset, the Stordalen Mire Viruses (SMV) dataset, consists of viral contigs recovered from 226 bulk metagenomes and 7 viromes, sampling the palsa, bog, and fen wetlands at the Stordalen Mire research site in northern Sweden (Emerson et al., 2018). A total of 850,446 viral contigs were identified by VirSorter1.0 (Roux et al., 2015a), DeepVirFinder (Ren et al., 2018), and MARVEL (Amgarten et al., 2018). Only 33,137 of them, which make up the intersection between the three tools (Category 1 in VirSorter, Score >=90% in Marvel, AND a score of >=0.9 with a p-value of <0.05 in DeepVirFinder), were kept in our dataset. Default parameters were used in running DeepVirFinder and MARVEL while VirSorter was run in the virome decontamination mode for both datasets. Open reading frames (ORFS) were predicted using meta prodigal (Hyatt et al., 2010). The resulting protein sequences were filtered by removing ones with >=95% similarity to RefSeq’s bacterial and archaeal proteins, and then clustered by ClusterONE (Nepusz et al., 2012) to get rid of singletons– with default parameters except “minimum density (-d) =0.3, minimum number of sequences in a cluster (-s) = 0.2, and the maximum overlap between two clusters (--max-overlap) = 0.8”. Sequences within a cluster were then aligned using MUSCLE (Edgar, 2004), setting the maximum number of iterations (–maxiters) to 4. For each of these multiple sequence alignments, profile Hidden-Markov-Models were built by hmmbuild, and pressed using hmmpress, both included the HMMER3 package (https://github.com/EddyRivasLab/hmmer). Each protein within a cluster was compared to KEGG, UniRef90 and InteproScan (Kanehisa et al., 2002; Suzek et al., 2015; Mitchell et al., 2018) using USEARCH (Edgar, 2010) and the reciprocal best BLAST hits with greater than 60 bit score were saved and ranked, as described in (Daly et al., 2016).