2.2 | DNA sequencing and SNP genotyping
A reduced representation genomic dataset using RADseq was generated to genotype all individuals using genome-wide SNPs. RADseq libraries were prepared using a protocol modified from Etter et al. (2011), as specified in Gamble et al. (2015). Genomic DNA was extracted from ethanol preserved pectoral fin clips using a DNeasy Blood and Tissue Kit (Qiagen). Approximately 1 μg of DNA from each sample was digested at 37℃ for 2 hours using the high-fidelity SbfI restriction enzyme (New England Biolabs). A two-adapter process was utilized to create DNA libraries from the digested samples, with 10-15 individually barcoded samples allocated to each Illumina library pool. P1 adapters with unique barcodes were ligated to the specific SbfI overhang sequence exposed during the digestion process. Samples were then pooled, sheared using a sequence of cyclic sonication (Diagenode Bioruptor), and size-selected into approximately 300-400 base pair (bp) lengths using magnetic beads in PEG/NaCl buffer (Rohland & Reich, 2012). The libraries were blunt-end repaired and dA-tailed. P2 adapters were then ligated to each pooled library to provide unique Illumina identifiers and additional barcode sequences, necessary for the Illumina sequencing process (Andrews et al., 2016). All libraries were amplified via PCR using Q5 DNA polymerase (New England Biolabs) for 14 cycles, cleaned, and size-selected an additional time using a GeneRead Size Selection Kit (Qiagen). Genomic libraries were quantified using a Qubit fluorometer and pooled for Illumina sequencing. DNA fragment sizes were verified using a Bioanalyzer assay (Agilent) and the pooled libraries were sequenced with paired-end 75 and 100 bp reads on Illumina HiSeq 4000 (75 bp paired-end) and NovaSeq 6000 (100 bp paired-end) platforms at the Genomics Division of the Iowa Institute of Human Genetics, University of Iowa.
Raw Illumina sequence data were de-multiplexed and filtered by their adapter sequences using the process_radtags command inSTACKS v.2.54 (Catchen et al., 2013). Two mismatches were allowed within the adapter sequences ‘–adapter_mm 2 ’ and any reads marked as failures by Illumina’s purity filter were discarded by using the ‘–filter_illumina ’ options. The program flags, ‘-c ’, ‘-q ’, and ‘-r ’ were utilized to remove data with uncalled bases, low phred33 quality scores, and recover and repair barcode segments (Rochette et al., 2019). Read lengths for all samples were trimmed to a 70 bp length to provide data continuity between both Illumina sequencing platforms and to remove any low-quality bases from the 5’ end of reads (Catchen et al., 2011). Files output fromprocess_radtags were then concatenated to a single file named for each respective sample (Rochette & Catchen, 2017).
Due to the lack of an available reference genome for E. caeruleum , the denovo_map.pl script (STACKS ) was run to create a consensus sequence catalog, generate SNP loci, and formulate candidate alleles for individual fish (Catchen et al., 2011). Parameters for the assembly process were optimized using testing procedures described in Paris et al. (2017). The minimum depth of coverage required to create a stack, a set of short-reads which match exactly, was set to a value of 3 ‘-m 3 ’. The maximum distance allowed between stacks was set to 3 nucleotides ‘-M 3 ’ while the mismatches allowed between sample loci during catalog sequence assembly was set at 2 ‘-n 2 ’. The ‘–rm-pcr-duplicates ’ program flag was also applied to reduce the effect of PCR amplification bias by removing pairs of reads with the same insert length.
Three datasets were generated following the de novo assembly process, one for both river systems combined, and two to analyze the Volga and Meramec river data independently. For each of the datasets, the ‘-r -0.80 ’ flag was implemented for the populationscommand (STACKS ) as an optimization target and filtering step to ensure only SNP loci shared by a minimum of 80% of all individuals in a population are retained (Rochette & Catchen, 2017). The ‘–write-single-snp ’ program flag was additionally applied to restrict the analysis to the first SNP per locus and lessen the potential for error created by linkage disequilibrium. Finally, a minor allele frequency cutoff of 1% was applied with ‘-min-maf 0.01 ’ to filter and remove potential outlier loci. Supplementary program options for the populations function were used concurrently to generate values for differentiation among sampling localities in expected and observed levels of heterozygosity, private alleles, nucleotide diversity, and inbreeding coefficients (Rochette & Catchen, 2017). The software package PGDSpider 2.1.1.5 was used for additional conversion and formatting of the populations output (Lischer & Excoffier, 2012).