***Sequenced using HiSeq
Sequences were assessed for Illumina adapter sequences in the 3’-end of the read using cutadapt v1.7.1 (parameters: -a AGATCGGAAGAGC -e 0.08 –overlap=3) (Martin 2011). Sequences were then trimmed based on quality scores using Trimmomactic v0.33 with a sliding window of 10bp and an average quality score cutoff of 28 (Bolger 2014). All sequences trimmed below 75bp in length were discarded (parameters: SLIDINGWINDOW:10:28 MINLEN:75). Only read pairs for which both mate survived trimming were retained for assembly and read coverage analysis.
For each sample library, quality trimmed sequences were assembled using IDBA-UD v1.1.1 using the default parameters (Peng 2012). Contigs < 500bp in length were removed from consideration for further analysis.
Initial putative coding DNA sequences (CDSs) were determined for contigs generated from each sample using Prodigal v2.6.1 (parameters: -m -p meta -q) (Hyatt 2010). Prodigal was used to generate putative CDS for the sediment metagenomes used for functional comparison (see below). Prodigal-predicted CDS for the NP samples were compared to publicly available deep-sea metagenomes (see below). Each of the NP samples were processed through the IMG/M annotation pipeline (GOLD Analysis Project ID: Bottom water - Ga0071103; 1382A - Ga0071100; 1383C Deep - Ga0071101; 1383C Shallow - Ga0071102) (Markowitz 2013). IMG/M annotations were used for all searches involving metabolisms of interest and assignments of phylogenetic markers.
Additional metagenomes from deep-sea environments were accessed from IMG/M for comparison to the NP samples. Additional metagenomes were selected that utilized identical sequencing technologies (Illumina HiSeq2000). Assemblies and putative CDS were retrieved for samples from the Guaymas (Taxon object ID: 3300001683) and Abe, Lau Basin (Taxon object ID: 3300001681) hydrothermal vent plumes. Both sets of assemblies were generated using IDBA. Assemblies and putative CDS were also retrieved for samples of formation fluids from CORKs 1362A and 1362B positioned in the Juan de Fuca Ridge flank (Taxon object IDs: 3300002481 and 3300002532, respectively). Assemblies were generated using combination of SOAPdenovo (Luo 2012), Newbler (454 Life Sciences, Roche), and minimus2 (Sommer 2007). Assemblies from sediment sampled at 5 cmbsf from the south Pacific (unpublished) and at 75 cmbsf from the Arctic Mid-Ocean ridge (DDBJ/EMBL/Genbank Accession: LAZR00000000) (Spang 2015) were processed using Prodigal to generate putative CDS. The sample from the south Pacific was assembled using IDBA-UD. The sample from the Arctic Mid-Ocean ridge was assembled using SPAdes v3.0.0 (Bankevich 2012).
Putative CDS for the NP samples and the additional metagenomes were searched using HMMER3 v3.1b1 (Finn 2011) against the TIGRFAM v14 database (Haft 2003) (hmmsearch, parameters: -E 0.00001). From the hmmsearch results, putative CDS were assigned to TIGRFAM roles based on the best match. For each metagenome, the relative abundance of each TIGRFAM role was determined (no. of putative CDS assigned to a specific role ÷ total no. of putative CDS assigned to all TIGRFAM roles). The relative abundance of the 115 identified TIGRFAM roles for each sample was visualized using principal component analysis (PCA) to determined the relationship between samples. PCA was performed using the Python library sklearn v0.16.1. Values underwent dimensionality reduction, while being fit to the model.
Utilizing the IMG annotations, a database of genes representing the necessary and essential components of the major carbon fixation pathways searched using BLASTp v2.2.30+ (parameters: -evalue 0.001 -max_target_seqs 3) against the putative CDS for each NP sample (Altschul 1997). The database contained genes necessary to identify the reductive acetyl-CoA/Wood-Ljunghal pathway, the reverse citric acid (rTCA) cycle, the Calvin-Bensson-Bassham (CBB) cycle, the 3-hydroxypropionate/4-hydroxybutyrate (3-OH-propionate/4-OH-butryrate) cycle, and the 3-hydroxypropionate (3-OH-propionate) bicycle. Successful matches were limited to matches with > 30% amino acid identity (AAID) and > 30% protein alignment. Putative carbon fixation relevant CDS were used to recruit sequences from quality trimmed sequence libraries, which had been normalized by random sampling to the size of the smallest library (27,737,142 sequences), using BWA v0.6.1-r104(parameters: bwa aln -n 0; bwa samse -n 0) (Li 2009). Utilizing the output SAM file, the number of sequences assigned to a specific carbon fixation gene was determined.
To determine the putative taxonomies of certain carbon fixation processes (rTCA and Calvin cycles), the putative carbon fixation CDS were BLASTp (parameters: -evalue 1e-5 -max_target_seqs 5) against the NCBI RefSeq database (Tatusova 2015). MEGAN4 was then used to determine the last common ancestor (LCA) of the top five hits for each putative carbon fixation CDS (Huson 2011). Utilizing the number of sequences recruited to each putative carbon fixation CDS, CDS with the same taxonomic assignment, up to the Genus level, had their sequence counts combined and was compared to the total number of sequences recruited for that gene.