1 | Introduction
Aquatic biomonitoring has drastically changed in the last decade with
the advent of High Throughput-Sequencing (HTS) and the substantial cost
reduction of sequencing (Thomsen & Willerslev, 2015; Valentini et al.,
2016; Deiner et al., 2017; Lobo et al., 2017; Ruppert, Kline & Rahman,
2019; Wang et al., 2019). Environmental DNA (eDNA) amplicon-based
metabarcoding is increasingly being used for rapid and cost-effective
community characterization, and is now often promoted and is being
gradually integrated into routine monitoring programs (Leese et al.,
2016; Danovaro et al., 2016; Pawlowski et al., 2018; Pilliod et al.,
2019).
Metagenomics, herein defined as the study of the entire genetic material
recovered from environmental samples, has also received considerable
attention in the last few years (Bourlat et al., 2013; Grossart et al.,
2020). While amplicon-based eDNA metabarcoding provides information on
which organisms are present, metagenomics enables insights into the
functions they possess. This is particularly relevant when
microorganisms are being used as indicator organisms. For organisms such
as bacteria, little is known about the ecology of the vast majority of
species. As such, it is their metabolic capability rather than their
identity that is usually of greatest interest, and likely to provide
more information about current environmental conditions. Taxonomic and
functional profiles can respond differently to biogeography, abiotic
environmental variables (e.g. organic content, metal concentration) and
community processes and interactions. As such, they can exhibit
different level of stochasticity and temporality, and provide
complementary information that may increase our understanding of the
mechanisms behind community turnover (Barberan et al., 2012; Louca,
Parfrey & Doebeli, 2016; Hornick & Buschmann, 2018; Cordier et al.,
2020). Having both taxonomic and functional information also enables the
computation of functional redundancy within the community, which may
also help assess resilience (Escalas et al., 2019).
While metagenomics may eventually replace metabarcoding for whole
community assessment, it’s mainstream use is still limited by its
relatively low sample throughput and cost-efficiency, and heavy
computational and data management requirements (Bowman & Ducklow, 2015;
Nagpal, Haque & Mande, 2016; Breitwieser, Lu & Salzberg, 2018). To
circumvent these issues, several programs have been developed to infer
metabolic profiles from eDNA 16S ribosomal RNA (rRNA) gene data (e.g.
Functional Annotation of Prokaryotic Taxa [Faprotax]
(Louca, Parfrey & Doebeli, 2016), Piphillin (Iwai et al.,
2016), Vikodak (Nagpal, Haque & Mande, 2016), Prediction by
phylogenetIC plAcement [Paprica] (Bowman & Ducklow, 2015),
Phylogenetic Investigation of Communities by Reconstruction of
Unobserved States [Picrust2; Douglas et al., 2019)
and Tax4Fun2 (Wemheuer et al., 2018). The last three methods
use a hidden-state prediction (HSP) approach (Zaneveld & Thurber, 2014)
where genomic content is inferred according to the position of the
genome in a reference phylogenetic tree. These methods have been shown
to provide functional profiles that correlate, with a varying degree of
success, to metagenomics and metabolomics profiles (e.g. Bowman &
Ducklow, 2015; Douglas et al., 2019; Sun, Jones, & Fodor, 2020;
Wemheuer et al., 2018). While not as accurate as metagenomics functional
analyses, these HSP methods can provide more complete metabolic profiles
as they do not require high sequencing depth to assign functions (e.g.
pathways of rare taxa can still be assigned; Langille et al., 2013), and
may be useful in situations where metagenomics would be prohibitively
expensive, such as in broad microbial routine monitoring surveys.
Metabolic inference methods have been evaluated and used in a large
variety of studies, for example in clinical trials (Millares et al.,
2015), oyster aquaculture (Arfken et al., 2017), aquatic urban systems
(Wang et al., 2018), acid mine drainages (Aguinaga et al., 2018),
subterranean estuaries (Hong et al., 2019), gut microbiota
(Pacheco-Sandoval et al., 2019), marine biofilms (Salerno et al., 2018),
and coral reef associated microbiomes (Pearman et al., 2019). However,
there has been limited use in the context of benthic marine monitoring
surveys (Hornick & Buschmann, 2018; Laroche et al., 2018; Cordier,
2020). Before this application can be routinely applied in such
situation, it is essential to evaluate its performance. In particular,
the accuracy of functional inference methods is strongly influenced by
the completeness of the reference databases and by the genetic
plasticity of some taxonomic groups (Bowman & Ducklow, 2015). For
example, certain functions, especially those involving few genes, tend
to occur at shallow phylogenetic depth (Martiny et al., 2015) and can be
difficult to correctly predict.
The aim of this study was to evaluate the performance of three metabolic
inference methods, Paprica, Picrust2 and
Tax4Fun2, against metagenomics and environmental data, in the
context of salmon farm benthic surveys. In particular, we aimed to: 1)
compare predictions and abundance correlations between 16S rRNA
amplicon-based metabarcoding functional inference approaches (hereafter
referred to as HSP methods) and shotgun metagenomics functional
profiling, 2) contrast the taxonomic and functional microbial diversity
recovered from metabarcoding and metagenomics, and 3) assess and compare
the sensitivity of functional communities derived from HSP methods and
metagenomics with respect to microbial turnover and in correlation with
environmental data.
2 | Materials and
methods
2.1 | Sample collection
Benthic sediment samples (depths of 32-85 m) were collected at four
large scale Atlantic salmon (Salmo salar ) farms, two located in a
semi-exposed coastal region of mid-Norway (farms: FRØ and SMØ) and two
inside fjords in northern Norway (farms: NOR, STO) (Fig. 1). At FRØ, six
samples were collected: three biological replicates located next to the
pen (0 m), and three reference (control) replicates located 1,200 m away
from the fish pens. The objective of this sampling design was to compare
bacterial communities between impacted and non-impacted sediments. To
test HSP methods across different geographic settings within Norway,
samples were also collected next to the pen at the NOR and STO farms
(one each) in northern Norway, and two samples at SMØ (Southern-Norway),
one next to the pen and one located at a reference site located 7,920 m
away from the farm. Approximately 5 g of the top (1 cm) sediment layer,
collected with a van Veen grab sampler (surface area 0.1
m2), was subsampled per grab with a sterile spatula
and stored at -20°C until further processing.
2.2 | Physico-chemical and macrofaunal
analyses
A complimentary suite of environmental parameters was obtained from
parallel studies. The prevalence of three terrestrial fatty acids (oleic
acid, 18:1n-9; linoleic acid, 18:2n-6; and α-linolenic acid, 18:3n-3) in
the sediments, which indicate fish-feed-derived organic matter (White et
al., 2017), were assessed by Folch lipid extraction and direct
methylation as described in Woodcock et al. (2019). The organic and
inorganic carbon content of the sediment was determined by drying the
sediment at 40°C for 48 h followed by combustion at 450°C for 2 h
(LOI450). Measures of benthic respiratory fluxes, including ammonium
(NH4), total carbon dioxide (TCO2), and
oxygen (O2) were obtained from, and following the
methods described in Keeley et al. (2019). Additionally, the macrofaunal
communities were obtained from and characterized using the methods
described in Keeley et al. (2019) and in Keeley et al. (In
prep .).
2.3 | DNA extraction
Sediment samples were homogenized with a sterile stainless-steel micro
spatula (soaked in 50% bleach solution - commercial bleach diluted with
double-distilled water [ddH2O]) for a minimum of 5
min and rinsed with ddH2O between each sample),
subsampled (0.25 g), and processed with the DNeasy PowerSoil kit
(QIAGEN, California, USA) following the manufacturer’s protocol. DNA
purity was measured with a Nanophotometer (Implen, Munich, Germany),
integrity assessed on 1.5% agarose gels, and quantity measured with a
Qubit® Fluorometer (Life Technologies).
To assess potential cross-contamination, extraction blanks (sediment
replaced by ddH2O) were included. All sample handling
and extraction steps took place in a dedicated DNA laboratory that was
decontaminated with 50% bleach solution prior to DNA extraction.
2.4 | Targeted 16S rRNA library preparation
DNA extracts were set at equimolar concentration (5 ng/μL) with
ddH2O for a total volume of 25 μL per sample, stabilized
with DNAstable (Biomatrica, California, USA) following the
manufacturer’s protocol, and dry-shipped to the Cawthron Institute
(Nelson, NZ) for 16S rRNA library preparation. Upon arrival, DNA
extracts were rehydrated with 25 μL of ddH2O, and a
segment of the V3-V4 region of the 16S ribosomal RNA gene (approximately
450 base pairs [bp]) was PCR amplified using the forward
S-D-Bact-0341-b-S-17: 5’-CCT ACG GGN GGC WGC AG-3’ and reverse
S-D-Bact-0785-a-A-21: 5’- GAC TAC HVG GGT ATC TAA TCC-3’ primers from
Klindworth et al. (2013), modified to include
IlluminaTM overhang adaptors.
PCR reactions consisted of 22 μL of AmpliTaq Gold 360 PCR Master Mix
(Life Technologies), 8 μL of ddH2O, 1 μL of each primer
(10 μM, IDT, Iowa, USA), 5 μL of GC enhancer (Life Technologies), and 2
μL of template DNA (5 ng/μL). The reaction cycling conditions were 94°C
for 3 min, followed by 30 cycles of 94°C for 30 s, 52°C for 30 s, 72°C
for 1 min, with a final extension step at 72°C for 5 min. Each PCR
included a negative control (no template sample) to ensure the absence
of cross-contamination.
Amplicon purification and normalization (2 ng/uL) were performed with
the SequalPrepTM Normalization plates (Invitrogen,
California, USA) following the manufacturer’s instructions, and
submitted to New Zealand Genomics Ltd (Auckland, New Zealand) for
indexing with the NexteraTM DNA library Prep Kit
(Illumina, California, USA), pooling and paired-end (2 × 250 bp)
sequencing on a IlluminaTM MiSeq. One blank sample
(ddH2O) was included prior to indexing and sequencing to
control for potential contamination. Sequences are available from the
NCBI Sequence Read Archive (SRA) under project number PRJNA661323.
2.5 | Metagenomics library
preparation
DNA extracts were set at equimolar concentration (8 ng/μL) with 10 Mm
Tris and sent on dry ice to the Norwegian Sequencing Center (NSC, Oslo)
for library preparation with the NexteraTM DNA Flex
Tagmentation kit (Illumina, California, USA) following the
manufacturer’s protocol. After indexing, samples were pooled and
sequenced on ½ Novaseq SP flow cell (Illumina, California, USA) with a 2
x 150 bp paired-end protocol. One blank sample (ddH2O)
was included prior to indexing and sequencing to assess potential
contamination. Sequences are available from the NCBI Sequence Read
Archive (SRA) under project number PRJNA661323.
2.6 | Bioinformatic analysis of 16S rRNA
data
Primers from the demultiplexed fastq files were removed with
cutadapt (version 2.6; Martin, 2011) and reads quality
filtered, denoised, merged and chimera filtered with the DADA2
R program (version 1.14; Callahan et al., 2016). Prior to
quality filtering, reads were truncated at 226 and 220 bp on the 5’ end
to remove the lower quality section and reduce the number of reads lost
during quality trimming. Quality filtering and denoising were performed
using the default parameters, and merged using a perfect minimum overlap
of 10 bp. Chimera removal was performed using the consensus method where
sequences found to be chimeric in the majority of samples (default value
= 90 %) are discarded. Sequences found in negative controls, including
DNA extraction, PCR, indexing and sequencing blanks were examined and
subsequently removed from across all samples. Sequences unclassified at
kingdom level or not identified as bacteria were discarded.
Additionally, rare amplicon sequence variants (ASVs; less than 10 reads
across the entire study) were removed from the dataset. The resulting
sequences were used for taxonomic and functional profiling using three
pipelines: Paprica (version 0.5.2; Bowman & Ducklow, 2015),
Picrust2 (version 2.2.0_b; Douglas et al., 2019) and
Tax4Fun2 (version 1.1.4; Wemheuer et al., 2018). These methods were
chosen for their popularity in the scientific literature, their
compatibility with large datasets, and their reliance on KO (KEGG
ortholog numbers) and EC (Enzyme commission numbers), which can be
easily assessed against metagenomics results obtained from
Humann2 (Franzosa et al., 2018), our chosen
functional profiling methodology. Prior to the taxonomic assignment and
metabolic inference, sequencing depth per sample was visualized with the
“rarecurve” function of the “vegan” R package (version
2.5.6; Oksanen et al., 2019) to ensure that all samples had sufficient
sequencing depth to recover most of the diversity (Fig. S1). The default
parameters implemented within each method were used to keep the analysis
as simple as possible.
2.7 | Bioinformatic analysis of metagenomics
data
The metagenomics pipeline used is based on the fully automated workflow
of the Humann2 software (version 2.8.2; Franzosa et
al., 2018). Reads were first pre-processed with KneadData
(version 0.7.4; https://bitbucket.org/biobakery/kneaddatae) to
perform both quality filtering with Trimmomatic (version 0.39;
Bolger, Lohse, & Usadel, 2014), and screening of undesired reads
(herein phix, human and salmon DNA) with Bowtie2 (version
2.3.5.1; Langmead & Salzberg, 2012). Profiling of taxa was performed
with MetaPhlAn2 (version 2.7.8; Truong et al., 2015) and
results used to construct a sample-specific database from functionally
annotated pangenomes (referred to as the ChocoPhlAn database).
Humann2 then performed a nucleotide-level mapping with
Bowtie2 of all reads against the custom database, followed by a
translated search with Diamond (version 0.8.36.98; Buchfink,
Xie, & Huson, 2015) for those that did not align. Reads that remained
unaligned were subjected to an additional translated search against the
UniRef50 protein database (Suzek et al., 2015). The gene family
abundance table was then converted to a KO and EC tables using the
“Humann2_regroup_table” function and the
uniref50_ko and uniref50_level4ec groups, respectively.
2.8 | Data analysis and
statistics
Taxonomic and functional richness differences between metabarcoding and
metagenomics data were visualized with box and bar plots using the
“ggplot2” R package (version 3.3.1; Wickham, 2016).
Using presence/absence data, the prediction of pathways from HSP methods
(herein Paprica, Picrust2 and
Tax4Fun2) was tested against the metagenomics profiles using
the “caret” R package (version 6.0.86; Kuhn, 2020) and the
“confusionMatrix” function, and visualized with the “alluvial” R
package (version 0.1.2; Bojanowski & Edwards, 2016).
Spearman correlations of functions across all samples and per samples
between the HSP methods and the metagenomics data were assessed using
the “stats” R package (version 3.6.1; R Core Team, n.d.) and the
“cor” function. For this analysis, EC and KO abundances were
transformed to the centered-log ratio with the “clr” function of the
“composition” R package (version 1.40.4; van den Boogaart,
Tolosana-Delgado, & Bren, 2020) and only functions shared between
inference methods and metagenomics were maintained. Because several ECs
and KOs co-occur within pathways and genomes, correlation of functions
between HSP and metagenomic samples can be naturally high, even between
completely unrelated samples (Douglas et al., 2019; Sun, Jones & Fodor,
2020). To take this dependency into account, we added a randomized
dataset referred to as “null expectation” for each HSP method. Based
on the original ASV abundance table, the permaswap function of the
“vegan” R package (parameters: times = 1, burnin = 20000, thin = 500,
mtype = ”count”, shuffle = ”both”) was first used to create a dataframe
with permuted samples and ASVs. This table was then inputted in to each
HSP method to obtain “null expectation” datasets of functional
inferences. Differences between results of the actual and “null
expectation” data were tested with Welch two sample t-tests.
Correspondence of ASV and functional communities between HSP methods
with metagenomics was assessed with a procrustes test using the
“protest” function (symmetric analysis with 9,999 permutations and
scores = “sites”) of the “vegan” R package (version 2.5.6; Oksanen
et al., 2019). The correspondence of the ASV and functional communities
derived from HSP and metagenomics with macrofaunal communities
(transformed with the Wisconsin method implemented in “vegan”) as well
as physico-chemical data (scaled with the rda function of the “vegan”)
was also assessed with procrustes tests using the same parameters.
Physico-chemical data were only completely available for the FRØ
locality, therefore only samples from this site were kept for the latter
analysis. In addition to the procrustes tests, the sensitivity of each
dataset towards fish farming was assessed using permutational analyses
of variance (PERMANOVA; Anderson, 2005) between samples collected at the
0 m and ≥ 1200 m from the pen. For the procrustes and PERMANOVA
analyses, two methodologies were evaluated: 1) Euclidean distance
matrices computed from centered-log ratio transformed abundances, as
suggested in Gloor, Macklaim, Pawlowsky-Glahn, & Egozcue (2017) for
compositional data, and 2) Jaccard distance matrices for
presence/absence data. Multivariate homogeneity of groups dispersions
analyses between distance categories (0 m and ≥ 1200 m) were performed
with the “betadisper” function of the “vegan” R package.
Compared to reference sites, benthic environments in proximity to fish
farm activities are typically characterized by higher concentrations of
organic matter and nutrients, especially phosphorus and nitrogen
(Buschmann et al., 2006; Wang et al., 2012), which can lead to eutrophic
conditions and anaerobic microbial degradation (e.g. sulphate reduction
and methanogenesis; Valdemarsen, Kristensen & Holmer, 2009). To explore
this, the response of pathways associated to the nitrogen and sulfur
cycle between near and far-field sites, and between metagenomics and
HSP-based pathways profiles were compared. Pathways were clustered to
their parent class based on the MetaCyc database
(https://metacyc.org/) and only classes involved in nitrogen
(fixation, ammonification, nitrification, denitrification and
degradation) and sulfur cycle (oxidation, reduction and degradation)
were maintained. Groups of pathways that associated to more than one
parent class were filtered out. The response of the remaining classes
between pen and reference sites was analyzed using centered-log ratio
transformed data and the ‘lm’ function of the ‘stats’ R package and
visualized with barplots using the ‘ggplot2’ R package. Only
Paprica and Picrust2 provide pathways data
derived from the MetaCyc database and are similar to those of
Humann2, therefore only these two HSP methods were
assessed. In addition, functional groups determined by the
Faprotax methodology (version 1.2.1; Louca, Parfrey & Doebeli,
2016), where ecologically relevant groups are assigned to ASVs based on
available literature from cultured strains, was performed and
differential abundance assessed with Ancom2 (version 2.1; Kaul
et al., 2017) between pen and reference sites in order to compare
results with metagenomics and HSP data.