Taxonomic and functional inference of metagenomic reads
Reads were quality trimmed using Sickle (Joshi & Fass, 2011) with phred
>30 and then uploaded to MG-RAST (Keegan et al., 2016).
Functional and taxonomic profiles of reads were generated through
subsystem and best hit classifications using the SEED subsystem, M5NR
(non-redundant protein database) and KEGG, available in MG-RAST (Aziz et
al., 2008; Kanehisa & Goto, 2000; Keegan et al., 2016; Wilke et al.,
2012), with the following parameters: 1x10-5 e-value,
minimum 50 bp alignment, and 60% identity. Data generated by MG-RAST
were statistically analyzed using Statistical Analysis of Metagenomic
Profiles (STAMP) software (Parks et al., 2014) and R software (R
Development Core Team), using the packages vegan (Oksanen, 2007)
and ggplot (Wickham, 2011). The p values were calculated
using Fisher’s exact two-sided test and the confidence intervals were
calculated using the method of Newcombe-Wilson. Statistical comparisons
were performed by grouping the samples according to environmental
temperatures: glaciers, fumaroles up to 80 oC and
fumarole at 98 oC. Principal component analysis (PCA)
ordination was performed by using level 3 functions of SEED subsystems
and then visualized in STAMP software. Values were normalized to
relative abundance for comparison of taxonomic composition across
samples. In addition, Spearman correlations were performed to determine
relationships between taxonomic and functional profiles and the
environmental parameters.
To investigate the complexity of community interactions at each sampling
site, we used co-occurrence network analysis. For this, non-random
co-occurrence analyses were performed using the Python module ‘SparCC’
(Friedman & Alm, 2012). A table of frequency of hits affiliated to the
genus level was used for analysis. For each network, we considered only
strong (SparCC > 0.9 or < -0.9) and highly
significant (p < 0.01) correlations between microbial
taxa. The nodes in the reconstructed network represent taxa at the genus
level, whereas the edges represent significantly positive or negative
correlation between nodes. The analysis of network complexity was based
on a set of measures, such as the number of nodes and edges, modularity,
the number of communities, average node connectivity, average path
length, diameter, and cumulative degree distribution (Newman, 2003).
Network visualization and property measurements were calculated with the
software Gephi (Bastian et al., 2009).