Bioinformatics
Sequencing reads were assigned to samples using the unique dual indexes and to markers based on the primer sequence. Sequencing adapters and locus-specific markers were removed from paired-end reads using the linked-adapter function in cutadapt (Martin, 2011). Trimmed reads were imported into qiime2 (Bolyen et al., 2019) and data were then denoised, in a process including estimating sequencing error rates from the data, correcting errors, dereplicating sequences, removing chimeras, and then merging overlapping paired-end reads using the dada2 denoise-paired function with default parameters (Callahan et al., 2016). Reads were then truncated at +/- 10% of the expected amplicon-size for each locus to remove low quality bases at the trailing ends of sequencing reads (Callahan et al., 2016). The denoising algorithm derives an error model from the sequencing data and aims to correct errors in order to minimize single-nucleotide sequencing errors and therefore, preserve true species-level variation as amplicon sequencing variants (ASV; Callahan, McMurdie, & Holmes, 2017). Because of the short fragment length of the minibarcodes, often just a small number of base-pair differences distinguish species within the same genus (Edgar, 2018), and thus, ASVs have the potential to provide better species-level taxonomic resolution than other analysis methods (e.g., operational taxonomic units, OTU; Callahan, McMurdie, & Holmes, 2017).
Despite similar DNA inputs to PCR reactions, sequencing read depth varied several orders of magnitude among markers for reference DNA pool samples (SI, Fig. S2). To better evaluate marker performance, we subsampled 100,000 reads per marker per sample for each occurrence that included >100,000 reads (two markers, 16Svar and L2513/H2714 had <100,000 reads). Subsampling was done on the fastq files using the sample function from seqtk (https://github.com/lh3/seqtk) and then output (subsampled) data were reprocessed in qiime2 and dada2, as described above.
Data for each marker was analysed separately, and the resulting tables included read counts for every ASV sequence occurring in each sample. ASV sequence data were exported from qiime2 in FASTA format and then queried against a custom metazoan database derived from the NCBI Nucleotide sequence database. Separate databases for each gene (12S, 16S, COI, 18S, and 28S) were created by querying NCBI using Entrez. The search was performed using the gene name and a filter that included only animals and sequences matching the NCBI taxon ID for Metazoa (33208). Gene names were offered to Entrez in many variants (e.g. “COI”, “COX1”, “cytochrome oxidase subunit I”, “cytochrome oxidase subunit 1”) to account for variation in naming scheme (additional details and code on GitHub). For mitochondrial genes (12S, 16S, and COI), queries were also restricted to mitochondrial sequences and filtered by length to exclude sequences substantially shorter than any of the targeted amplicons (<50 bp) while including whole mitochondria of all sizes (filter <100,000 bp; largest known metazoan mitochondrial genome is 80,923 bp; Stampar et al., 2019). For nuclear genes (18S and 28S), the same length filters were used to exclude short sequences (<50 bp) and long genomic scaffolds (<100,000 bp). Search terms “scaffold” and “shotgun” were excluded from search queries. These filtered Entrez queries were then downloaded separately in FASTA format. By using only NCBI nucleotide data for genes that corresponded to our markers, and only searching the custom database that corresponded to the appropriate gene for each primer set, we were able to reduce overall computation time and potential off-target BLAST hits. Downloaded sequences were further filtered to remove environmental, unverified, uncultured, and protein database sequences and then configured into a BLAST database with makeblastdb from the BLAST+ suite (Camacho et al., 2009; code available on GitHub).
FASTA output from qiime2 was queried against the reference databases using BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990) command line with the blastn program using 96% minimum identity, 95% query coverage; e-value = 10, reward = 2 and penalty = -3, penalty to open a gap = 5 and to extend a gap = 2, and with no limit to the number of hits per sequence (no culling limit). The BLAST query might have included some liberal matches, but all BLAST hits were further filtered and refined in subsequent steps (the highest e-value for data passing filter = 1.54 e-38). GenBank accession numbers were used to access the taxonomic lineage for each hit using a custom bash script which utilizes NCBI’s accession2taxid files and TaxonKit (Shen & Xiong, 2019) to extract seven taxonomic ranks (i.e., domain, phylum, class, order, family, genus, species) for each BLAST hit (code on GitHub).
GNU Parallel was used throughout the bioinformatic pipeline to run multiple jobs simultaneously while maintaining the optimal number of threads for each program (Tange, 2011).