igure 1. Map of the area where the animals were collected and the experimental work were carried out. The locality of Vityaz Bay is marked with a circle. The asterisk indicates the Vostok Bay. The square marks the location where the animals were placed in the aquarium to form the mock communities.
1.3 The DNA isolation and individual genotyping of the hydrobionts
The extraction of total DNA from the fixed tissue was performed using K-SORB-100 kit (Syntol, Russia). The isolated DNA was then used to amplify a 313 bp (Geller et al. 2013; Leray et al., 2013; Wangensteen et al. 2018) and 650 bp (Ward et al., 2005) long COI gene mitochondrial fragments. For the latter a combination of FishF1 and FishR1 primers was suitable for all 3 species. The PCR mixture consisted of 5x Taq Red buffer (5 µl), dNTPS (0.5 µl, 10 mM), a pair of primers (0.12 µl, 0.05 mM each), Taq polymerase (0.25 µl, 1.25 units per reaction), 1 µl of DNA solution (20-100 ng), and distilled water to a final reaction volume of 25 µl. The PCR thermal algorithm started with a pre-denaturation for 10 min at 95 ⁰C. This was preceded by 35 cycles according to the following scheme: denaturation at 94 ºC for 1 min, annealing at 45 ºC for 1 min, and 1 min elongation at 72 ºC with a final elongation for 10 min. To check the results of amplification, we performed the electrophoresis of the fragments in 1% agarose gel followed by exposure in ethidium bromide solution and visualization under UV. The successfully amplified samples were purified by alcohol precipitation and the DNA pellet was dissolved in deionized water. A forward and reverse sequencing reactions were performed using purified amplicons together with corresponding primers used in PCR step according to the BrilliantDye™ Terminator Cycle Sequencing Kit manufacturer's instructions (NimaGen, Netherlands). The capillary electrophoresis of the fragments produced by cycle sequencing was performed on Applied Biosystems Genetic Analyzer 3500. Consensus sequences were assembled from the obtained chromatograms using Geneious program (Kearse et al. 2012). These sequences were used to search for the homologous ones in NCBI via the BLAST algorithm (Boratyn et al., 2012). The alignment together with the reference sequences selected from BLAST results and a read frame search using the translation tool was performed in MEGA 7.0 software (Kumar et al. 2016). Sequence matrices were then generated for each species separately and haplotypic variation was assessed in DnaSP v.6 (Rozas et al., 2017). Based on the results of hydrobiont genotyping (2 haplotypes for P. dybowskii and P. latirostris, as well as 3 haplotypes for H. octogrammus, see Table 1), individual aquatic DNA samples corresponding to the detected haplotypes of the 313 bp COI fragment were selected for the further work, one for each haplotype. After the detection of OTUs identified from the local database, it was necessary to clarify their relationships to the original haplotypes. In addition, the COI barcode sequences from H. octogrammus and P. dybowskii species were used to verify the species identity of the haplotypes obtained. Since no reference data are available for P. latirostris, we assembled COI fragments from reads of 4 transcriptomes of this species (Kawahara-Miki et al., 2011) using NOVOPlasty (Dierckxsens et al., 2016). Based on the combined matrix, a phylogenetic NJ tree was constructed in the MEGA 7.0 program (Figure 4). The robustness of the topology was assessed using 1,000 replicates of the nonparametric bootstrap test.
1.4 The DNA extraction from syringe filters, COI Leray fragment amplification, sequencing, and reads processing
The extraction of DNA from the syringe filters was performed using M-Sorb-OOM kit (Syntol, Russia) with modification of the manufacturer's protocol, according to which at the initial stage the lysis buffer was heated to 65ºC and passed through the filter tip in the opposite direction to the filtration (backflushing method, after Kesberg and Schleheck, 2013), draining the entire volume of the resulting liquid into a clean test tube. Based on the isolated DNA, a 313-bp long COI fragment was amplified (Geller et al. 2013; Leray et al., 2013; Wangensteen et al. 2018) with three replicates per sample. For each sample, we used a pair of primers with an individual 7-nucleotide tag at the 5′-end (doubly-tagged approach) that were developed in ecotag (Boyer et al. 2016). The negative PCR control was also performed by using separate pair of tagged primers. The reaction mixture included 10 μl of AmpliTaq Gold 360 Master Mix, 0.5 μl of each (forward and reverse) primer (10 μM), 0.16 μl of bovine serum albumin, 10 ng of DNA and deionized water to the final volume of 20 μl. The PCR thermal cycling profile included preheating at 95ºC for 10 min with subsequent 35 cycles according to the following scheme: 1 min at 94ºC, 1 min at 45ºC and 1 min at 72ºC. The final elongation was at 72ºC for 5 min. The results of amplification were checked in the same way as described above. The amplicons were purified using Cleanup S-Cap (Evrogen, Russia) and normalized (see (Elbrecht and Steinke 2019)) before pooling. The volume of control reaction was taken as an average of the obtained volume of normalized samples. The normalized amplicons were then combined with the control and sequenced at Novogene (Tianjin, China). The library was created using a PCR-free NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, England) and sequenced on an Illumina high-throughput sequencer by a 250 bp paired-end sequencing technology.
The resulting reads were processed according to the Begum metabarcoding pipeline (Yang et al., 2021; Zepeda-Mendoza et al., 2016). After removing adapters and pre-assessing the quality of the reads with fastqc, we used Spades (Bankevich et al., 2012) to correct possible errors. Paired-end reads were merged into consensus reads using PandaSeq (Masella et al., 2012). The reads were then demultiplexed and filtered using Begum. Clustering was performed in Sumaclust (Mercier et al., 2013) with parameters -t 0.98 and -R 0.85. Determination of the taxonomic identity of the generated operational taxonomic units (OTUs) was performed using BLAST+ 2.12.0 (Camacho et al., 2009). In addition, a local reference base was generated to identify the target OTUs and ZOTUs (ASVs), which consisted of 7 nucleotide sequences of previously found unique haplotypes (Table 1). The taxonomic information was then summarized in two separate tables (derived from global and local reference databases, see Table S1 and Table 3, respectively) based on the output from the MEGAN community edition program (Huson et al. 2016). The results derived from the local reference were further condensed using the lulu package (Frøslev et al., 2017) to test the effect of preserving non-original OTUs. Since the most acceptable (Antich et al., 2021) existing method for identifying ZOTUs in vsearch (Edgar, 2016) was found to be extremely sensitive to coverage, after obtaining consensus sequences we had to use several sequential searches using the grep command in the shell of Ubuntu to sort the reads based on sample tags followed by the search for the haplotypes obtained during Sanger sequencing (see section 1.3) in the sample-sorted reads content.
2. Assessing the genetic variation of COI fragments in the population datasets retrieved from GenBank
The most common groups of multicellular organisms (phylum Mollusca, phylum Echinodermata, subphylum Crustacea of Arthropoda phylum, class Polychaeta of Annelida phylum, and class Actinopterygii of Chordata phylum). Taxonomic-based searches were performed in the popset database of the NCBI GenBank resource (Benson et al. 2018) among the sequence sets for population-genetic studies. An important reason for the selection of these groups was their good coverage in GenBank, as well as the presence of homologous Leray COI marker fragments in NCBI. A total of 83 datasets were selected. Nine to 20 sets for each group (Supplement A, Table S2). There were at least 17 sequences in each dataset. Reduction of the retrieved sequences to the Leray fragment length was performed through their alignment together with reference datasets, which included 313 bp COI fragments with a retrieved reading frame as well as several complete COI gene sequences from each group. The sequence set then joined in the MEGA program together with the reference sequence set and was translated given the mitochondrial code corresponding to the taxon and aligned using ClustalW (Larkin et al., 2007) based on the Protein Weight Matrix BLOSUM with default parameters. If the alignment was successful (the Leray region was located within a fragment of the examined COI sequences, between 130 and 236 amino acid sites of the complete translated gene sequence), matrix was truncated according to the Leray fragment length. The haplotype function of the pegas package (Paradis, 2010) was used to estimate the haplotypic variability of all the data sets obtained in this manner (both the original and trimmed ones). The number of population clusters in each dataset was estimated using the Geneland program (Gulliot et al., 2005; Gulliot, 2008) without prior information on the geographic or other subdivision of the samples. The calculation was based only on the sites with SNPs. They were extracted from the sequence matrices using the SNP-sites program (Page et al., 2016). Next, we used the console version of the PGDSpider program (Lischer and Excoffier, 2012) to convert the sets into the format recommended by the Geneland authors (Guillot et al., 2011). During the preliminary stage, 2 independent MCMC runs were performed with the total number of generations equal to 100000 and 500000 and the number of generations accounted for the burn-in step equal to 200 and 250 (discarded after the search), respectively. Sampling from the parameters space was performed every 100 generations. The maximum number of populations simulated during the search was set to 20 with the correlated frequency model accounted. For those datasets that showed differences in the number of determined populations between the first two runs, or did not form a clear peak in the distribution of the number of populations along the chain after burn-in step and showed density below 0.5, an additional run was conducted using 1500000 generations to exclude possible undersampling during the search. Scripts providing simultaneous formatting as well as the analysis of all data sets are given in the supplementary material (Supplement B). Statistical processing and data visualization were performed in R (R Core Team, 2021).