Material and methods
Experimental design
We sampled 240 M. festivus at 12 sites distributed across three major Amazonian rivers (i.e., Rio Branco, Rio Negro and Rio Solimões) [Fig. 1]. These sites comprise ecosystems with drastically divergent physicochemical parameters: five black water sites (BAR, NEG, CEM, ANA and TEF) and seven white water sites (PIR, SOL, MAN, JAR, JAC, CAT, BRA) [Table 1]. These sites comprise three confluence zones between black and white water sites : (1) Six sampling sites were close to the confluence of two major Amazonian tributaries, the Rio Negro and Rio Solimões (ANA, CEM, CAT, JAC, JAR and MAN); (2) three sites were close to the confluence of the Rio Branco and Rio Negro (BAR, NEG and BRA) and (3) three sites were close to the confluence of the Rio Tefé and Rio Solimões (TEF, SOL and PIR) [Fig. 1].
Field trips were conducted from September to December 2018–2019 during the dry season. Sites coordinates were recorded using the Global Positioning System (GPS) and the fishing perimeters were assessed in the field according to each site’s local geography. We estimated the watercourse distance separating each sampling site using Google Earth pro (Image Landsat / Copernicus, U.S). A multiparameter YSI Professional+Series meter (YSI Inc/Xylem Inc USA) was used to characterize water physicochemical properties (conductivity and pH) on sites. Two litres of water were also sampled at each site 30 cm below the surface, the depth where M. festivus were fished (Pires et al. 2015), to characterize other water parameters in the laboratory. At the laboratory, dissolved organic carbon (DOC) was quantified and characterized, and the concentration of nutrients (NO2-, NO3-, silicates), free ions (Ca, Na, Cl, Mg, K) and 12 metals (Al, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, As, Cd and Pb) were determined according to the method detailed in Sylvain et al. (2019, 2020). All environmental parameters are the means of three measurements and can be found in the supplementary information section.
Twenty fish specimens were collected at each sampling site using a combination of small seine net-fishing and line fishing (fish collection permit number 29837-18). Shortly after collection, fish were euthanized using a classic MS-222 protocol to dissect a fin clip which was kept in NAP conservation buffer to preserve DNA integrity (Camacho‐Sanchez et al. 2013). Samples were frozen at -20 °C right after the dissection and until DNA extraction.
DNA Extraction and Sequencing
A total of 240 fin samples were collected and processed. Genomic DNA was extracted using the QIAGEN DNeasy® Blood and Tissues kit following the manufacturer’s instructions. Genomic DNA was then purified with AMPure beads (Beckman Coulter Genomics), according to the manufacturer’s instructions. Double strand DNA was quantified using the Qubit® 2.0 Fluorometer. GBS libraries and sequencing were done at the IBIS Plateforme d’Analyse Génomique using Pstl and Mspl as restriction enzymes. Barcodes were ligated to the digested DNA fragments and libraries were sequenced using Ion proton™ technology.
SNPs Calling
The raw sequence reads were trimmed with Cutadapt (Martin 2011) in order to remove the adapter sequences and sequence quality was assessed using FastQC. The sequences were extracted and trimmed to 80 bp using process_radtags in STACKS V1.5 (Catchen et al. 2013). After trimming with Cutadapt (Martin 2011) and extracting with process_radtags, samples had an average of 2.91 (sd = 0.91) million reads (n = 231). Nine samples were discarded during the filtration process because of low-quality sequences and/or insufficient coverage. The STACKS programs for de novo SNP discovery were then run with the following filters and parameters. Firstly, we ran Ustacks considering a population size of “p = 1”, a minimum stack depth of “m = 3”, a distance allowed between stacks of “M = 3”, a maximum number of mismatches allowed of “N = 5” and activating the “disable-gapped”, “H” and “deleverage” options. We have tested multiple parameter combinations and these parameters lead to an optimal discovery of SNPs, are robust and lead to a low false discovery rate for de novo SNPs discovery (Paris et al. 2017). Afterward, we allowed a distance of “n = 1” between catalogue loci in Cstacks and applied the following filtration and transformation steps: Sstacks to match samples against the catalogue, tsv2bam to transpose the data so it is stored by locus, gstacks to call variant sites in each population and ran populations using a “p = 4”, a “r = 0.6” and the option “vcf” to produce a variant call format (VCF) file (Rochette and Catchen 2017). These filtration steps led to 428,095 putative SNPs across M. festivus’ genome. SNPs with heterozygosity rates higher than 0.5 were filtered out, which is considered stringent. The resulting VCF file was further filtered using the script “05_filter_vcf_fast.py” from stacks workflow (https://github.com/enormandeau/stacks_workflow) with the following parameters: a minimum allele coverage to keep genotypes of “m = 3”, a minimum percent of genotype data per population of “percent_genotype = 60”, a maximum number of populations that can fail percent genotypes of “max_pop_fail = 0” and a minimum number of samples with rare alleles of “min_mas = 2”. In total, 41,268 SNPs were conserved after these filtration steps, keeping only non-duplicated loci and removing SNPs in linkage. Overall missing data for the dataset was 11.70%. Also, we screened for high relatedness between samples and estimated the mean heterozygosity rate at each site using the function “genetic_diff” from the library vcfR in R (Knaus and Grünwald 2017).
Data analysis
Mesonauta festivus Phylogeographic History
Based on our 41,268 SNPs dataset (n = 231), we used ADMIXTURE, a program that estimates individual ancestry from SNP datasets (Alexander and Lange 2011), to calculate the posterior membership probability of each sample considering two to eight potential genetic clusters in the dataset. Then, we selected the number of biologically significant genetic clusters present in the data based on a combination of cross-validation error values from ADMIXTURE, the goodness of fit (BIC) measures from the “find.cluster” function from Adegenet (Jombart 2008) in R and the analysis of multiple membership probability plots from ADMIXTURE. In addition, pairwise fixation indexes (Fst) were estimated assuming four genetic clusters with the “stamppFst” function from StAMPP (Pembleton et al. 2013) in R. We produced a heatmap to illustrate the linearized Fst/(1-Fst), as they are more adapted to detect scenarios of isolation by distance in linear landscapes like riverscapes (Rousset 1997).
We conducted a multiple linear regression on distance matrices (MRM) (Lichstein 2007), an extension of the partial Mantel analysis. We considered a pairwise “Fst/(1-Fst)” matrix as the dependent variable and used three explanatory matrices: (1) a matrix with pairwise river course distances between sites for isolation by distance; (2) a matrix indicating whether pairs of sites are from the same water type (same = 0, different = 1) for isolation by ecology and (3) a matrix indicating whether pairs of sites are connected by downstream water flow (flow-connected = 0, flow-unconnected = 1) for isolation by unidirectional downstream water currents. The linear MRM was run with 1000 permutations to assess the significance and the power of the three explanatory variables at explaining the genetic dissimilarity between the sampled sites. To visualize the results from the MRM analysis, we produced three simple Mantel tests (Mantel 1967) using the function “mantel.randtest” from the package ade4 in R to look for one by one linkage between the pairwise linearized Fst/(1-Fst) matrix and the three explanatory matrices used in the MRM.
Environmental association study
Water Type Comparison Using Environmental Parameters
At each site, water type was assessed based on visual observations and literature research prior to the environmental parameters’ characterization. Using the library factoextra in R, we calculated a principal component analysis (PCA) on normalized environmental parameters, normalized as deviations from the mean, between sites. It was used to assess which environmental variables were associated with certain water types. The EAS required environmental variables not to be strongly correlated with each other. For this reason, we selected 5 environmental parameters not strongly associated with each others which can differentiate black and white water sites in our study. The selected environmental parameters were the concentration of silicate in suspension (µmol/L), concentration of dissolved organic carbonate (DOC) (mg/L), conductivity (µS/L), productivity (µg of Chla/L) and aluminum concentration (µg/L). Again, using the library factoextra, we produced a biplot to visualize the importance of these environmental parameters in the differentiation of the black and white water types.
Detecting Genetic-Environment Associations
Since EAS models do not accept missing data in their genotype files, we used the ADMIXTURE most probable posterior membership probability result of each sample to impute the missing genotypes according to the procedure described on GitHub: enormandeau/stacks_workflow (Alexander and Lange 2011). We used three proven and recently developed approaches of EAS (de Villemereuil et al. 2014, Rellstab et al. 2015, Capblancq et al. 2018, Forester et al. 2018), which are correcting for neutral genetic variation, to detect associations between our 41,268 SNPs and the diverging environmental parameters of the 5 black and 7 white water sites sampled. For the three EAS methods, we used the imputed SNPs dataset as the dependent variable and used the selected normalized environmental parameters and the water type directly, as a qualitative variable (black water = 1; white water = -1), as the explanatory variables to detect associations between the SNPs and the environment.
First, a constrained ordination redundancy analysis (RDA) (Forester et al. 2016; Legendre & Legendre 2012) was performed to detect covariation between loci and environmental predictors. We performed the RDA considering the genotype matrix as the dependent variable and the environmental parameters as explanatory variables. The covariation and multicollinearity between environmental parameters were verified, with the function “vif.cca” from vegan (Oksanen et al. 2019) in R, to ensure that the variance inflation factor was below 5 for each variable. We verified for the significance of each constrained axis and plotted the RDA to look how the samples at each site are clustering according to their respective environmental parameters. Afterward, we selected the SNPs which were very significantly associated to a given constrained axis by selecting outliers SNPs assuming 3.5 standard deviation cut-offs (two-tailed p-value = 0.00046). These putatively under selection SNPs were then associated to the environmental variable they are the most strongly correlated to. We plotted the selected SNPs to visualize their distribution according to the selected constrained axes.
Second, we used Baypass 2.1 (Gautier 2015), a Bayesian hierarchical model based on the BayEnv model (Coop et al. 2010), to detect linear associations between environmental predictors and genetic markers. We performed the core model of Baypass 2.1 with the following parameters (-npop 12, -gfile imputed SNPs dataset, -efile environmental parameters normalized as deviations from the mean, -burnin 10000 and -pilotlenght 2500). Using a heat map, we compared the resulting covariance matrix of population allele frequencies to the linearized Fst/(1-Fst) heat map to ensure that the neutral genetic variation was adequately computed by Baypass. Afterward, SNPs strongly associated with environmental predictors (eBPis > 1.5) in the “betail_reg_out” file were selected as putatively under selection for a given environmental variable.
Finally, we used a latent factor mixed model (LFMM2) (Caye et al. 2019), a least-squares estimation approach to detect associations between environmental parameters and genotypes. We performed the LFMM using the function “lfmm2” from the library LEA (Frichot and Francois 2015, Caye et al. 2019) in R. We ran the function “lfmm2.test” with the genomic control activated and considering a linear model. We corrected for the neutral genetic structure by considering four latent factors based on the four genetic clusters previously detected. We verified the validity of the model using a histogram of p-values for each explanatory variable, aiming for a flat histogram with a peak near zero. Afterward, we adjusted the p-value associated to each SNP using a Bonferroni correction and selected significant SNPs (p-value < 0.05) as putatively under selection.
These three programs have proven to be robust and correct for the neutral genetic structure (Capblancq et al. 2018; de Villemereuil et al. 2014; Forester et al. 2018; Rellstab et al. 2015). Nonetheless, we decided to reduce the false positive rate by selecting only loci detected by at least two of the three methods using the function “calculate_overlap” from the package VennDiagram (Chen and Boutros 2011) in R (de Villemereuil et al. 2014). Selected markers associated with environmental variables were then extracted from the full-genotype dataset and visualized using a PCA generated with Vegan (Oksanen et al. 2019) in R.