2.3 Bioinformatics processing to identify OTUs at different
thresholds of genetic similarity
The resulting paired-end reads of the 42 samples were quality filtered
following procedures described by Arribas et al. (2020). Briefly the
processing included quality checking, primer removal, pair merging,
quality filtering, denoising, and clustering each library independently.
We checked raw reads quality with fastqc(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
We trimmed primers using fastx_trimmer option of thefastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) to trim
the primer sequences (20 and 26 bases for R1 and R2, respectively).
Then, we processed reads in trimmomatic-0.36(Bolger et al., 2014)
using TRAILING:20 to cut-off bases of the end of a read, if below a
threshold quality of 20. We used R1 and R2 reads to search paired
sequences with pairfq-0.17(Staton, 2013) and we
merged paired reads overlapping with fastq_mergepairs command,-fastq_minovlen 50 and -fastq_maxdiffs 15 inusearch-9.2(Edgar, 2013). We
used quality-filtered (Maxee = 1), dereplicated (-fastx_uniques )
and sorted (-sortbylength ) options to keep only reads of 418 pb
in usearch-10(Edgar & Flyvbjerg,
2015). Surviving reads were denoised to generate zero-radius OTUs
(ZOTUs) with the unoise3 and -minsize 4 commands. ZOTUs
are equivalent to Amplicon Sequence Variants (ASVs; sensu Callahan et
al., 2016), and are proposed to be a set of predicted biological
sequences to be used for direct analysis without the need of OTU
clustering (Callahan
et al., 2017; Edgar, 2016).
We then assigned high-level taxonomic categories in all of the reads
using the lowest common ancestor (LCA) algorithm of MEGAN-6(Huson et al., 2016).
Taxonomic identification of each read was done using BLAST against thenucleotide NCBI nt database (June 06 2018; blastn -outfmt
5 -evalue 0.001). We fed BLAST matches into MEGAN(Huson et al., 2016),
and the taxonomic assignments were used to extract eight ASV datasets,
each one of the following orders: Diptera, Collembola, Arachnida,
Coleoptera, Hymenoptera, Hemiptera, Myriapoda, Lepidoptera. Remaining
sequences were no further considered. We exported the tree (in newick
format), visualised and edited it using figtree-1.4.3 (Rambaut,
2012). Each ASV dataset was aligned in geneious-8.0.2 with MAFFT,
using the FFT-NS-1 algorithm, a scoring matrix of 200/PAM/K=2, GAP open
penalty of 3, and the Translation align option. We reviewed all
sequences for insertions, deletions or stop codons disrupting the
reading frame, which were afterwards excluded.
Subsequently, we generated a community table with read-counts (haplotype
abundance) of each retained ASV for the eight orders by matching ASVs
against the complete collection of reads (i.e., reads before the
dereplicating and denoising steps) using -search_exact command
with usearch -10(Edgar & Flyvbjerg,
2015). Additional filtering according to AVS abundances in community
tables (one ASV per taxa) was performed as described by Arribas et al.,
(2020). Shortly, we first removed from each library those haplotypes
with abundances of four or fewer reads (same criteria than denoising).
Second, we identified haplotypes contributing with less than 1% of the
total reads of the library where they were present, and removed them
from the analysis. The 1% cut-off value has been seen in similar
datasets to remove most of the spurious ASVswhile maximizing the number
of real haplotypes
(Andújar et al., 2020;
Arribas et al., 2020). Lastly, the community tables of filtered
haplotypes were transformed into incidence (presence/absence) data and
used for downstream analyses.
Finally, we defined lineages at different clustering levels for each of
the orders. For this, each of the ASV filtered dataset was used to
generate an UPGMA tree with corrected distances under a F84 model, and
based on this tree, all haplotypes were nested into clusterin levels
(CLs) following the genetic similarity at different thresholds (0.5%,
1.5%, 3%, 5% and 7.5%), plus an additional threshold corresponding
to the result of a species delimitation analyses conducted with the
generalized mixed Yule-coalescent (GMYC) model in R(Pons et al., 2006).
The analyses were performed using vegan (Oksanen et al., 2013),cluster , PMCMR , hier.part , ecodist , andbetapart(Baselga & Orme,
2012).