2.2 | DNA sequencing and SNP genotyping
A reduced representation genomic dataset using RADseq was generated to
genotype all individuals using genome-wide SNPs. RADseq libraries were
prepared using a protocol modified from Etter et al. (2011), as
specified in Gamble et al. (2015). Genomic DNA was extracted from
ethanol preserved pectoral fin clips using a DNeasy Blood and Tissue Kit
(Qiagen). Approximately 1 μg of DNA from each sample was digested at 37℃
for 2 hours using the high-fidelity SbfI restriction enzyme (New
England Biolabs). A two-adapter process was utilized to create DNA
libraries from the digested samples, with 10-15 individually barcoded
samples allocated to each Illumina library pool. P1 adapters with unique
barcodes were ligated to the specific SbfI overhang sequence
exposed during the digestion process. Samples were then pooled, sheared
using a sequence of cyclic sonication (Diagenode Bioruptor), and
size-selected into approximately 300-400 base pair (bp) lengths using
magnetic beads in PEG/NaCl buffer (Rohland & Reich, 2012). The
libraries were blunt-end repaired and dA-tailed. P2 adapters were then
ligated to each pooled library to provide unique Illumina identifiers
and additional barcode sequences, necessary for the Illumina sequencing
process (Andrews et al., 2016). All libraries were amplified via PCR
using Q5 DNA polymerase (New England Biolabs) for 14 cycles, cleaned,
and size-selected an additional time using a GeneRead Size Selection Kit
(Qiagen). Genomic libraries were quantified using a Qubit fluorometer
and pooled for Illumina sequencing. DNA fragment sizes were verified
using a Bioanalyzer assay (Agilent) and the pooled libraries were
sequenced with paired-end 75 and 100 bp reads on Illumina HiSeq 4000 (75
bp paired-end) and NovaSeq 6000 (100 bp paired-end) platforms at the
Genomics Division of the Iowa Institute of Human Genetics, University of
Iowa.
Raw Illumina sequence data were de-multiplexed and filtered by their
adapter sequences using the process_radtags command inSTACKS v.2.54 (Catchen et al., 2013). Two mismatches were allowed
within the adapter sequences ‘–adapter_mm 2 ’ and any reads
marked as failures by Illumina’s purity filter were discarded by using
the ‘–filter_illumina ’ options. The program flags,
‘-c ’, ‘-q ’, and ‘-r ’ were utilized to remove data
with uncalled bases, low phred33 quality scores, and recover and repair
barcode segments (Rochette et al., 2019). Read lengths for all samples
were trimmed to a 70 bp length to provide data continuity between both
Illumina sequencing platforms and to remove any low-quality bases from
the 5’ end of reads (Catchen et al., 2011). Files output fromprocess_radtags were then concatenated to a single file named
for each respective sample (Rochette & Catchen, 2017).
Due to the lack of an available reference genome for E.
caeruleum , the denovo_map.pl script (STACKS ) was run to
create a consensus sequence catalog, generate SNP loci, and formulate
candidate alleles for individual fish (Catchen et al., 2011). Parameters
for the assembly process were optimized using testing procedures
described in Paris et al. (2017). The minimum depth of coverage required
to create a stack, a set of short-reads which match exactly, was set to
a value of 3 ‘-m 3 ’. The maximum distance allowed between stacks
was set to 3 nucleotides ‘-M 3 ’ while the mismatches allowed
between sample loci during catalog sequence assembly was set at 2
‘-n 2 ’. The ‘–rm-pcr-duplicates ’ program flag was also
applied to reduce the effect of PCR amplification bias by removing pairs
of reads with the same insert length.
Three datasets were generated following the de novo assembly
process, one for both river systems combined, and two to analyze the
Volga and Meramec river data independently. For each of the datasets,
the ‘-r -0.80 ’ flag was implemented for the populationscommand (STACKS ) as an optimization target and filtering step to
ensure only SNP loci shared by a minimum of 80% of all individuals in a
population are retained (Rochette & Catchen, 2017). The
‘–write-single-snp ’ program flag was additionally applied to
restrict the analysis to the first SNP per locus and lessen the
potential for error created by linkage disequilibrium. Finally, a minor
allele frequency cutoff of 1% was applied with ‘-min-maf 0.01 ’
to filter and remove potential outlier loci. Supplementary program
options for the populations function were used concurrently to
generate values for differentiation among sampling localities in
expected and observed levels of heterozygosity, private alleles,
nucleotide diversity, and inbreeding coefficients (Rochette & Catchen,
2017). The software package PGDSpider 2.1.1.5 was used for
additional conversion and formatting of the populations output
(Lischer & Excoffier, 2012).