Population structure
As a first analytical step, we explored population structure based on genealogies derived from the indSeq SNPs. The purpose was to develop a sense for the genetic relatedness among marine stickleback across the Atlantic Ocean, and to re-asses the relatedness of the freshwater populations among each other and to marine fish based on SNP data from whole-genome indSeq (in Haenel et al. 2019, the latter was done with SNPs derived from pooled RADseq). For computational efficiency, we reduced the full indSeq SNP panel to a random subset of 200,000 autosomal SNPs, additionally considering sample sizes of 100,000 and 15,000 SNPs in supplementary analyses (all these data sets were largely independent, as the choice of SNPs was random). For all 44 marine and freshwater individuals, we then derived haploid multilocus genotypes by drawing at each SNP the more frequent allele, or a random allele when both were equally frequent. This haploid strategy (Berner 2021) circumvented the ambiguity of diploid genotyping in individuals with low read depth. The haploid genotypes were then concatenated to nucleotide strings in fasta format.
The genotype data above were derived from SNPs chosen at random across the genome. However, both marine-freshwater and acidic-basic divergence in stickleback involves selection on numerous loci across the genome (Jones 2012b; Roesti et al. 2014; Bassham et al. 2018; Haenel et al. 2019; Terekhanova et al. 2019; Fang et al. 2020). To assess to what extent natural selection influences population structure, we additionally explored the genetic relatedness among our marine and freshwater individuals based on a subset of indSeq SNPs filtered to reduce the influence of selection. Following the strategy of Haenel et al. (2019), we excluded SNPs exhibiting an absolute allele frequency difference (AFD; Berner 2019) greater than 0.4 in both a global marine-freshwater comparison performed by pooling two random nucleotides drawn from the pileup of each individual at each SNP within the marine versus freshwater group of individuals, and in the acidic-basic comparison described below. As the latter included a MAF threshold of 0.25, we applied the same threshold in the marine-freshwater comparison. Moreover, we here considered exclusively SNPs located within the peripheral 5 Mb of each chromosome (Berner & Roesti 2017). These regions display particularly high recombination rates in stickleback (Roesti et al. 2013; Glazer et al. 2015), hence are those least affected by hitchhiking (linked selection). The 120,448 SNPs passing these filters were treated as above to obtain haploid genotype strings. We hereafter call the randomly chosen genotype data ‘Random SNPs’ and the markers chosen to reduce the footprint of selection ‘Neutral SNPs’, emphasizing that in the latter, a signal of selection may still persist.
For an earlier investigation of the genetic relatedness among North Uist stickleback based on poolSeq data, we used synthetic multilocus genotypes generated by concatenating alleles drawn from RAD sequenced sample pools (Haenel et al. 2019), thereby erasing individual-level haplotype structure. To assess the value of such synthetic genotypes for capturing genetic structure among populations, we here pooled the nucleotide counts at a number of random and neutral SNPs matching the individual-level data described above. We then drew a single nucleotide per sample location according to the observed pooled allele frequencies, and saved these draws concatenated to a single haploid nucleotide string per location in fasta format. The synthetic genotype data produced in this way allowed comparing genealogies based on truly individual-aware versus synthetic genotypes derived from the same SNP panel.
Based on the genotype files, genealogies were generated by using theape (v5; Paradis & Schliep 2018) and phangorn (v2.5.5; Schliep 2011) R packages. We determined the most appropriate models of sequence evolution, constructed maximum likelihood genealogies, and visualized them as unrooted phylograms. Node support was determined based on 500 bootstrap iterations. As an alternative to phylograms, we also considered exploring population structure by ordination (PcoA). However, the proportion of variation captured by the first ordination axes was consistently small (c. 8% or less). We therefore considered ordination an ineffective tool for pattern recognition.