Locus complementarity and taxonomic resolution
Species-level identification in the DNA pools proved more challenging than either genus or family identification, even though all taxa were represented by one or more of the reference databases for the target genes (Fig. S4). Species resolution assignment is more desirable, but also more challenging to obtain because barcoding genes often include insufficient variation to confidently distinguish congeneric species. The challenge of amplifying taxonomically informative variation is particularly true for minibarcodes, which capture a smaller section of larger barcoding genes. Furthermore, reference database information (i.e., GenBank, BOLD) is less complete at the species-level than for genera or families (i.e. some databases lack data for individual species, but a much higher proportion of genera and families are represented) and databases are less accurate for species- and genus-level identification (Leray et al., 2019; Locatelli et al., 2020). For these reasons, biodiversity studies may choose to assign data to family, class, or order, rather than species in order to capture greater taxonomic breadth (e.g., Djurhuus et al., 2020; Leray & Knowlton, 2015). Our results affirm that such an approach would accurately detect 100% of families present in our reference DNA pool.
Notably, two of the top-performing markers amplified adjacent, non-overlapping regions of the COI gene. COI markers benefit from the most complete reference database of the genes we tested (SI, Fig. S4), which is consistent with prior studies of fish tissues (Devloo-Delva et al., 2019). The strategy of including multiple markers for the same gene has been applied often in plant barcoding as well as for 18S rRNA in animals (e.g., Coghlan, Shafer, & Freeland, 2020; Machida & Knowlton, 2012). Fewer studies show the added benefit of multiple COI markers (but see Corse et al., 2019; Shokralla, Hellberg, Handy, King, & Hajibabaei, 2015; Valdez-Moreno et al., 2019). Including two COI minibarcode markers in our portfolio hedges against the limitations of amplifying degraded samples while leveraging the robust COI reference data for diverse marine and freshwater taxa.
Despite the popularity of 12S for metabarcoding marine and freshwater fishes, and the commensurate abundance of reference data (Miya et al., 2015; Masaki Miya, Gotoh, & Sado, 2020), our top 12S primer set identified fewer reference taxa than the top COI and 16S markers. However, the 12S locus contributed more unique species-level identifications that were not recovered by other genes (Fig. 2), hinting at the overall utility of this region for barcoding fish to the species level. Coupled with results from Zhang et al. (2020), in which 12S markers identified the largest number of fishes from waterbodies in Beijing, our results reinforce the expectation that optimal markers may differ across habitats and taxonomic groups, even within fishes.
Markers targeting specific taxonomic groups – sharks, plankton, crustaceans, and cephalopods – provided no additional resolution for reference taxa in the DNA pools (because our representatives from these taxa were detected with our top-performing teleost fish primers). Surprisingly, COI markers designed for sharks and plankton performed nearly as well on teleost fish as the best universal fish COI markers. However, the opposite was true for crustacean and cephalopod markers, which had little utility outside their targeted taxonomic groups. Admittedly, we had few representatives of these groups in the DNA pools to test the potential increased resolution of taxon-specific markers, so our results are not conclusive, but suggest that markers can show high performance outside their immediate target group (Fig. 2, 3).
Both 18S markers included a higher proportion of false positives and contamination in the extraction blanks and PCR negatives than other gene regions, possibly due to a mismatch between the resolution of the 18S barcoding region and the species composition of the DNA pools (e.g., 18S may be better for identifying diverse groups to class or order and consequently picks up more bacterial contamination; SI, Fig. S4). A similar explanation – non-specific amplification – may account for the limited number of target taxa amplified by the lone 28S locus. Interestingly, a prior study noted that non-specific amplification in COI markers impaired eDNA analyses for marine and freshwater fishes (Collins et al., 2019); yet this study did not test either of our top-performing COI markers, illustrating both the impressive number of universal fish COI markers and that non-specific amplification resulting in false detections can vary among markers within a single barcoding gene and for different applications, i.e., tissue mixture metabarcoding or eDNA.
Unfortunately, three markers that have amplified well in other studies (e.g., Polanco et al., 2021; Pont et al., 2018) got so few sequencing reads that we were unable to retain them in our analysis. The three markers that dropped out were also those that, based on preliminary data (agarose gel bands), we chose to amplify in multiplex PCR reactions (paired with one additional marker, in each case). However, for each of the three multiplexes, one marker performed well, and one did not. Thus, our exploration of multiplex reactions revealed challenges that would require taking amplified products through to sequencing in order to confirm that both markers receive a comparable number of reads (De Barba et al., 2014). Despite the validation steps necessary for effective multiplexing, doing so with complementary markers that amplify different barcoding genes could ultimately yield a more efficient laboratory workflow.
Taken together, our results underscore the advantages of using an optimized portfolio of barcoding markers (similar to results described by Shaw et al., 2016; Zhang, Zhao, & Yao, 2020), yet also reveal that adding markers to a portfolio without testing for complementarity can increase project costs and laboratory effort without improving detection or identification. Further, additional markers can increase the number of false positive observations – by nontarget amplification or mismatches with reference data – and these issues can be more acute when researchers seek high-resolution species identification from broad biodiversity surveys. For studies aiming to quantify biodiversity based on sequence variation patterns, researchers should also be aware of potential nontarget amplification of nuclear mitochondrial pseudo-genes (numts), and can use available software (i.e., metaMATE , Andújar et al., 2021) to remove these sources of error.