Description
Summary:Metadata record for data from ASAC Project 2899 See the link below for public details on this project. We conducted a genomic analysis of Archaea and Bacteria collected from lakes in the Vestfold Hills, Antarctica. This provided a new level of understanding about the life forms inhabiting these cold lakes. Linked to knowledge of meteorological, geological, chemical and physical data that has been collected over years of previous research, the new genomic data will generate a complete understanding of how the microorganisms have evolved and how they have transformed and presently interact with the Antarctic environment. Deriving an integrated understanding of microbial ecology is essential for determining ways of preserving the health of the World's ecosystems. The data are available for download as an excel spreadsheet and a word document from the URL given below. The GPS coordinates where samples were collected from are as follows: (Note these are UTM (Universal Transverse Mercator) coordinates, from zone 44D) Ace Lake: 44D 0384881 (easting), 2401821 (northing) Deep Lake: 44D 0385351, 2391772 Organic Lake: 44D 0384928, 2403550 The fields in this dataset are: Water temperature - degrees Celsius Specific conductivity - micro Seimens per centimetre Conductivity - micro Seimens per centimetre Salinity - parts per trillion Dissolved oxygen % - % Dissolved oxygen concentration - milligrams per litre Dissolved oxygen charge - This is an engineering value. The value is unit less, the recommended reading is 50 plus or minus 25. If you have a low reading it generally means you need to replace the membrane and if you have a high reading you need to recondition the probe. PressureA (This a depth reading of the Sonde) - (pounds-force per square inch absolute) Water depth - metres pH pHmV (This is the pH millivolt reading that the probe is outputting the Sonde) - millivolts Turbidity - (nephelometric turbidity unit) BP (Barometric Air Pressure) - psi (pounds per square inch) Taken from the 2008-2009 Progress Report: Progress against objectives: New lake and ocean samples, including additional opportunistic samples from Heard Island, were obtained Oct-Dec 2008. All samples from 2006 forward are being processed. This includes DNA (metagenomics) and protein (proteomics). A great deal of bioinformatic analyses have been performed on metagenome data. Metaproteomics has also proceeded well. Details of some of the progress are as follows: In the reporting period 1,064,488 Sanger sequencing reads were produced with 967,410 passing quality control, which at an average of 700bp provided 677Mb of sequence data. The reads were produced in batches for each sample. We generated assembly statistics and phylogenetic profiles after the completion of each batch. Sample diversity then guided the sequence allocation for each sample. A number of pragmatic software tools have been created to perform the analyses. As an example, for one sample the whole sample assembly was characterised by read depth, GC content, di-nucleotide frequency (Tetra) and tri-nucleotide frequency (Tetra) on a per scaffold basis. The intrinsic properties then formed vectors in a feature space on which a self-organising map clustering analysis was performed. The cluster which comprised the most abundant species was isolated and the genes annotated. This represented 9 contigs with a total of 1.7Mb and 1683 predicted genes. For this sample, proteins were extracted and metaproteomics performed resulting in a total of 3970 confident peptides matched providing identities for 504 proteins (at least 2 peptide matches per protein) representing about 30% coverage. In comparison, a total of 170 proteins were identified against the non-redundant database. In other metaproteomic analyses, samples from 4 lake depths provided a total of 7,925 peptides providing the identification of 1015 proteins against the NCBI non-redundant protein database (matches not yet performed to annotated metagenome data). For testing detection limits and accuracy of identifications using a metaprotomics approach, a simulated mixed community study was performed using S. alaskensis and E. coli. This has shown that cell numbers, protein abundance and cell volumes all impact the ability to detect proteins of individual microorganisms within a population. The type and size of the database the metaproteomic dataset is searched against (non-redundant versus S. alaskensis + E. coli protein database) also resulted in differences in protein detection. The work has been useful for optimising parameters used for metaproteomics of the Antarctic samples. An interesting eukaryotic virus that dominates the biomass of one of the samples is being analysed with the present work focusing on classifying and characterising. Transmission electron microscopy of the water sample revealed virus-like particles of approximately 150nm but it was unclear from morphology if they represented a single virus type or several. Two complementary metagenomic assembly approaches are being used to produce the most complete assembly possible of the large viral sequences. The first assembly strategy follows a conventional metagenomic workflow consisting of assembly of the whole metagenomic dataset followed by taxonomic binning of the constructs. An initial assembly has been constructed after determining the optimum acceptable degree of error. A high degree of assembly was evident with the largest scaffold spanning 108kb with 6 X coverage. A BLASTx search of the five largest contigs (greater than 10kb) produced two alignments to Major Capsid Protein (MCP) genes; one to the short MCP gene of Chyrsochromulina ericina virus (28% identity) and the other to the full MCP gene of Phaeocytis pouchetii virus (76% identity). Sequence flanking the full MCP gene corresponds to conserved hypothetical protein sequences from Ostreococcus virus 5 (45% identity) and Paramecium sp. Chlorella virus AR158 (39% identity). These large deeply assembling contigs will be used to 'tune' the parameters to improve assembly of the entire metagenome. A preliminary attempt to bin the scaffolds using tetra nucleotide frequencies from the initial assembly has not completely resolved into clear taxonomic clusters. A multi-dimensional binning approach including sequence coverage, GC content, nucleotide frequencies along with identification of marker genes is being developed and will be applied once an optimum whole metagenomic assembly has been completed. Although the presence of conserved genes is a promising sign of accurate assembly, validation of the scaffolds by comparison to sequenced virus genomes is uninformative as viruses are poorly represented in the public databases and extremely diverse. Instead, a second assembly strategy is underway that will conservatively extract and compile the viral sequence. The reads assigned in an initial MEGAN analysis to the large dsDNA viral clade were used in a preliminary round of assembly. This first assembly will be used as a reference to recruit more overlapping fragments and combined in another round assembly extending the construct from the high confidence 'seeds'. Cycles of recruitment and assembly will continue until the assembly reaches an end point. This is a new method of assembly that potentially can be used to extract and produce confident assemblies of other species with no sequenced representatives. Comparison between this virus specific assembly and the conventional metagenomic assembly will allow evaluation of the fidelity of both processes.