Taxonomic assignment
Separate from the sample-based data decontamination procedure, described above, taxonomic assignment for each metabarcoding sequence required evaluating the full set of BLAST hits for each ASV using a custom R script (R Core Team, 2019). The goal of the R script was to obtain the highest taxonomic resolution for each sequence while accounting for all BLAST hits above the 96% minimum identity required by the blastn query. Species-level identification was only accepted if the ASV sequence matched the database reference sequence at >98% identity (as in Alberdi et al., 2018) and only then if no BLAST hits within 2% identity of the top hit matched a different species. When BLAST results for a given ASV violated either of these rules, the next taxonomic level (i.e., common ancestor) was tested using the same criteria and so on until a consensus taxonomic rank was obtained within the top 2% identity of matches. For example, when an ASV had only a significant hit to a single species, that species was assigned unless the sequence match was <98% identity, in which case, the ASV would be assigned to the genus-level. However, when an ASV had significant hits to multiple taxa, the common ancestor for BLAST hits within the top 2% identity determined whether that sequence could be attributed to a species, genus, or family, or whether the sequence provided little informative variation for a high-resolution assignment (code available on GitHub).
Decontaminated ASV and read count data were merged with taxonomic information from ranked and filtered BLAST hits. Multiple ASVs within a locus that matched the same taxon at more than one taxonomic level (e.g., one ASV identifies the family Clupeidae and another matches the genus Clupea ) were merged to retain the highest-resolution assignment (in this example, the genus, Clupea ) for each taxon within each replicate/locus. We reasoned that both sequences would likely come from the same fish, and therefore retained the higher-resolution assignment.
Finally, taxonomic assignments were used to compare the performance of the individual markers and metabarcoding loci for recovering species added to the vouchered reference and full reference DNA pools, as well as determining the optimal combination of markers to maximize identification of reference taxa to species-level.
The portfolio of complementary markers was identified by ranking markers using an accumulation curve to identify which recovered the greatest number of species from the FR, followed by the greatest number of additional species, and so on until the curve saturated (Fig. 1). The minimal panel of primer pairs that captured the full species diversity in the DNA pools were used to analyse the experimental feeds and examine quantitative relationships between relative tissue abundance and sequencing read proportions in heterogeneous mixtures.