Taxonomic and functional inference of metagenomic reads
Reads were quality trimmed using Sickle (Joshi & Fass, 2011) with phred >30 and then uploaded to MG-RAST (Keegan et al., 2016). Functional and taxonomic profiles of reads were generated through subsystem and best hit classifications using the SEED subsystem, M5NR (non-redundant protein database) and KEGG, available in MG-RAST (Aziz et al., 2008; Kanehisa & Goto, 2000; Keegan et al., 2016; Wilke et al., 2012), with the following parameters: 1x10-5 e-value, minimum 50 bp alignment, and 60% identity. Data generated by MG-RAST were statistically analyzed using Statistical Analysis of Metagenomic Profiles (STAMP) software (Parks et al., 2014) and R software (R Development Core Team), using the packages vegan (Oksanen, 2007) and ggplot (Wickham, 2011). The p values were calculated using Fisher’s exact two-sided test and the confidence intervals were calculated using the method of Newcombe-Wilson. Statistical comparisons were performed by grouping the samples according to environmental temperatures: glaciers, fumaroles up to 80 oC and fumarole at 98 oC. Principal component analysis (PCA) ordination was performed by using level 3 functions of SEED subsystems and then visualized in STAMP software. Values were normalized to relative abundance for comparison of taxonomic composition across samples. In addition, Spearman correlations were performed to determine relationships between taxonomic and functional profiles and the environmental parameters.
To investigate the complexity of community interactions at each sampling site, we used co-occurrence network analysis. For this, non-random co-occurrence analyses were performed using the Python module ‘SparCC’ (Friedman & Alm, 2012). A table of frequency of hits affiliated to the genus level was used for analysis. For each network, we considered only strong (SparCC > 0.9 or < -0.9) and highly significant (p < 0.01) correlations between microbial taxa. The nodes in the reconstructed network represent taxa at the genus level, whereas the edges represent significantly positive or negative correlation between nodes. The analysis of network complexity was based on a set of measures, such as the number of nodes and edges, modularity, the number of communities, average node connectivity, average path length, diameter, and cumulative degree distribution (Newman, 2003). Network visualization and property measurements were calculated with the software Gephi (Bastian et al., 2009).