We notice that it was not possible to achieve the detection of original COI haplotypes in reads originating from natural eDNA samples. The same was true for some individual eDNA samples. Combined with the small number of common OTUs between the artificial community samples, this may indicate rather low coverage, which can be noted as a major drawback of the study (see Table 2). It is possible that primer bias may also affect the results (Shu et al., 2021). In addition, all of the artificial samples were noticeably contaminated with the reads from the species Oncorhynchus keta contained in the tank located in the same room. But it appears that the water simply contains much more microbial DNA (in our case it is the plylum Bacillariophyta) than the commercial fish and invertebrate species of interest (Collins et al., 2021), and the reason lies in fundamental differences in primer design approaches for target species detection and metabarcoding surveys (Freeland, 2017). Typically, in such studies at least 30,000 reads per sample are required to reach a plateau at the number of OTUs (Dully et al., 2021). At the same time, there is an evidence that even for individually designed markers for target species, the number of detectable haplotypes using eDNA can be reduced relative to their actual number (Adams et al., 2022).
As recent studies show that the environmental DNA can provide detailed information on a species-specific haplotype diversity, although currently only a few studies have successfully used this approach to obtain population-genetic data (Sigsgaard et al., 2016; Elbrecht et al., 2018; Tsuji et al., 2020a,b; Andres et al., 2021; Adams et al., 202). Factors such as sequencing errors that lead to false positives, as well as limitations in the understanding of the processes that affect the rate of environmental DNA production affect the validity of the results (Eble et al., 2020).
A study of haplotype diversity among Plecoglossus altivelis altivelis fishes using the mitochondrial D-loop marker showed high accuracy in determining variation by using ASV noise reduction algorithms, which eliminated false positives (Tsuji et al., 2019). Further studies in this area showed that the haplotype diversity estimated from invasive screening was lower compared to estimates obtained using environmental DNA at all screening sites. Non-invasive and invasive DNA sampling were found to be prone to overestimate and underestimate intraspecific genetic diversity, respectively. Similarly, the results showed variation depending on which of the noise reduction algorithms was used (Tsuji et al.,2020a).
Another work also indicates that the analysis of environmental DNA yields a result close to the estimated intraspecific diversity of the entire population. In addition, it has been observed that the difference in the number of haplotypes between estimates based on DNA extracted from tissues and environmental DNA is most likely due to the difference in the sample coverage (Tsuji et al.,2020b).
In the mentioned studies, the D-loop was used as a marker, which is a non-coding region with a level of variability higher than that of most other mitochondrial and some nuclear markers. At the same time, it is known that markers encoding a protein, which includes COI, have a lower level of mutations and, in addition, may tend to increase the detection accuracy due to noise suppression.
When comparing ASVs and OTUs isolated from individual specimens using specifically a local reference, we found an interesting pattern. Thus, the lack of detectability of ASVs is observed in those species which exhibit some additional OTUs. Accordingly, it makes sense to assume that they (the sequences forming the additional OTUs) can compete for primer hybridization during PCR in comparison to the original ASVs, which can ultimately result in a global underestimation of the genetic diversity of live organisms from eDNA based on the use of this marker (in particular). Previously, these two measures were considered in parallel, and the possibility of distorting one indicator at the expense of the other was not discussed, especially the reasons for such a bias.
To clarify the nature of the additional OTUs and their origin, the relevant assumptions have been made, considering them as: 1) sequencing errors (artifacts); 2) derivatives from duplicated regions of the mitochondrial genome; 3) cryptic (or overlooked) species diversity; and 4) NUMTs or pseudogenes.
Since the origin of identical sequencing errors with high coverage in different samples is unlikely, this version has to be rejected. Currently, there is no information on the complete sequences of the mitochondrial genome of P. latirostris, but a closely related species of this genus, P. borealis, does not exhibit such structural features (Viker et al., 2006). In fish, duplication of the mitochondrial genome fragments is generally quite rare phenomenon (however, see the flatfish Li et al., 2015), and species close to H. octogrammus also do not exhibit such a features (Ji et al., 2020). The presence of cryptic diversity cannot be excluded for P. latirostris not only because of the lack of a complete reference for the genus, but also because of the "eating" of several specimens in the artificial communities during the experiment (see materials and methods). Accordingly, at least a part of the OTUs of this species can be attributed to unidentified, but existing haplotypes in the artificial communities (Fig. 4, Table 3 – OTU 32). At the same time, comparison with the data from the sites of the original species description (Fig. 4, id. DRR) indicates that the intraspecific variation of P. latirostris on the basis of the marker used is homogeneous. Hence, we can conclude that OTUs diverging from the reference ones by at least 0.023 (Fig. 4) should be considered as originated from a different type of phenomenon. They, like such OTUs of H. octogrammus species, are highly likely to be nuclear copies of mitochondrial sequences or NUMTS, which is a type of pseudogenes (Hazkani-Covo et al., 2010; Marshall and Parson, 2021). The divergent OTUs (Fig. 3) are characterized either by nonsynonymous nucleotide substitutions (P. latirostris) or a shift in the reading frame (H. octogrammus), due to which they can be attributed to pseudogenes, having previously excluded other interpretations of their origin. In general, it is rather difficult to differentiate NUMTS from true sequences of the mitochondrial DNA (Hazkani-Covo et al., 2010; Nugent et al., 2020; Porter, Hajibabaei, 2021), and it is much easier to work in this respect with model organisms, for which all kinds of references are available (Marshall, Parson, 2021), as well as try to prevent accumulation of NUMTs at the experimental stage (Wang et al., 2019). Organismal DNA samples, meanwhile, do not tend to detect many NUMTS during Sanger sequencing because of their small number in the amplicon pool to be generated (Hebert et al., 2004; Schultz, Hebert, 2021) as opposed to eDNA. In our case (with the exception of OTUs with 2-nucleotide deletions carrying a reading frame shift), we can be satisfied with the exclusion method only and only if a pre-collected complete barcode DNA reference library for the species in question is available. At this point, we can expect that almost all phylogeographic works based on environmental DNA metabarcoding is subject to a kind of "survivorship bias," where only what has passed selection by primers and sequencing coverage is analyzed. It cannot be excluded that our study also carries this error. The processing of the obtained OTUs using the lulu feature gave, basically, the expected results. For example, we showed that sufficiently divergent NUMTS are not eliminated, but remain as a separate OTUs (Figure 4, Table S3), especially in the presence of a large number of reads for them. The problem is that there is no proper database of pseudogenes, and we again return to the "survivorship bias". In this respect, we agree with our colleagues who point to the necessity of maintaining a pseudogenes database in addition to the well-curated barcode reference library (Schultz and Hebert, 2021).
Table 4. The ASVs found with local reference database indicating the number of reads