2.3 | Population genomics
To identify and characterize genetic clusters within the combined river dataset, and within the river-only datasets, two methods were used for comparison: a Bayesian assignment test and a discriminant analysis of principal components (DAPC). The Bayesian assignment test was performed using the software package fastStructure (Raj et al., 2014). DAPC analyses were run using the adegenet package for R (Jombart, 2008; Jombart et al., 2010).
Large modern SNP datasets can impose challenging computational requirements in time and processing power. The software packagefastStructure makes use of efficient algorithms to employ a Bayesian framework model for the inference of the total genetic clusters (K) within the data, and assignments of fish to a cluster based on individual genotypes without a priori definitions (Falush et al., 2007; Hubisz et al., 2009; Pritchard et al., 2000; Raj et al., 2014). The range of potential K values chosen included a value one above the total number of sampling localities within each dataset (Pritchard et al., 2000). K values = 1 - 11 were analyzed for the combined river dataset, and K = 1 - 6 for each river separately. Additional parameters used for fastStructure included ‘–cv=500 ’ which enabled cross-validation over 500 test runs. The supplemental programStructure_threader was used to decrease the overall processing time required by fastStructure by the automation and parallelization of runs upon multiple CPU processing threads (Pina-Martins et al., 2017). Structure_threader also automated the identification of the most appropriate K value for each dataset using the fastStructure chooseK.py script to pinpoint the value of K that maximizes marginal likelihood (Raj et al. 2014). Visualizations and plotting of population memberships and admixture fromfastStructure outputs were completed using Distruct v.2.3(Chhatre, 2019).
A DAPC identifies differences between groups through discriminant functions (Jombart et al., 2010). The sampling localities within each river system were used as the groups in this test. A DAPC analysis can be substantially affected by the selection of user-defined numbers of principal components (PC) to preserve. The find.clusters andxvalDapc functions within the R package adegenet provided a procedure for effective cross-validation and optimization to identify the number of PCs to keep for each dataset (Jombart & Collins, 2015). The number of PCs retained for each analysis was therefore selected by using the value of primary components with the lowest root mean squared error (RMSE) after 100 iterations per PC values of 1 - 100 for the combined dataset, and values 1 - 50 for each independent river system.
Pairwise FST values between all localities within each dataset were calculated as described in Weir & Cockerham (1984) using the hierfstat package for the R platform (Goudet, 2005; R Development Core Team 2020). An analysis of isolation by distance was conducted for the Volga-only and Meramec-only datasets using a Mantel test. Pairwise FST values were linearized (FST / 1 - FST) following Rousset (1997) and river distance measures were used. The Mantel test was conducted with 100,000 replicates in the R package ade4 (Dray & Dufour, 2007). Finally, a nested analysis of molecular variance (AMOVA) was performed for each of the three datasets to further determine the spatial structure of genetic diversity. The analyses were performed using Arlequin v.3.5.2.2 (Excoffier & Lischer, 2010).