2.4 Bioinformatics pipeline and SNP filtering
Raw sequence data were processed using Stacks v2.62 (Catchen et al., 2013) and aligned to the genome with BWA v0.7.17-r1188 (Li & Durbin, 2009). Full details of the bioinformatics pipeline, which produced a variant call format (VCF) file containing 45,488 variants for SNP filtering in R v4.0.3 (R Core team 2020) are provided in Supplementary file S1. We filtered genotyped variants using the “SNPfiltR” v1.0.0 package (DeRaad, 2022) based on (i) minimum read depth (≥ 5), (ii) genotype quality (≥ 20), (iii) maximum read depth (≤ 137), and (iv) allele balance ratio (0.2 – 0.8). Then, using a custom R script, we filtered SNPs based on (i) the level of missing data (< 5%); (ii) minor allele count (MAC ≥ 3), (iii) observed heterozygosity (< 0.6), and (iv) linkage disequilibrium (correlation < 0.5 among loci within 500,000 bp).
To ensure that relationships between individuals could be accurately inferred from the data, we used these SNPs and samples to construct a hierarchical clustering dendrogram based on genetic distance, with visual examination of the dendrogram confirming that all 24 replicates paired closely together on long branches (Figure S1). We also made a higher-level bootstrapped dendrogram by using genetic distances among sampling localities instead of individuals (Figure S2). The percentage difference between called genotypes of technical replicates was also used to confirm that genotyping error rates were low after filtering (mean 99.91% ± 0.005% SE similarity between replicates). We therefore removed one of each replicate pair from all further analyses.
We used “tess3r” (Caye et al. 2016, 2018) to perform a genome scan for loci under selection, using the Bejamini-Hochberg algorithm (Benjamini & Hochberg, 1995), with a false discovery rate of 1 in 10,000 to correct for issues associated with multiple testing. Because this method identified zero candidate loci under selection, we also used the gl.outflank function in “dartR” v2.0.4 to implement the OutFLANK method (Whitlock & Letterhos 2015) to infer the distribution of FST for loci unlikely to be strongly affected by spatially diversifying selection. This method also identified zero putatively adaptive loci, leaving a final dataset for formal population genetic analysis containing all 70 originally sampled individuals, 5,239 biallelic SNPs, and an overall missing data level of 0.98 %. The number of SNPs and samples removed from the dataset at each filtering step is provided in Table S2.