Population structure
As a first analytical step, we explored population structure based on
genealogies derived from the indSeq SNPs. The purpose was to develop a
sense for the genetic relatedness among marine stickleback across the
Atlantic Ocean, and to re-asses the relatedness of the freshwater
populations among each other and to marine fish based on SNP data from
whole-genome indSeq (in Haenel et al. 2019, the latter was done with
SNPs derived from pooled RADseq). For computational efficiency, we
reduced the full indSeq SNP panel to a random subset of 200,000
autosomal SNPs, additionally considering sample sizes of 100,000 and
15,000 SNPs in supplementary analyses (all these data sets were largely
independent, as the choice of SNPs was random). For all 44 marine and
freshwater individuals, we then derived haploid multilocus genotypes by
drawing at each SNP the more frequent allele, or a random allele when
both were equally frequent. This haploid strategy (Berner 2021)
circumvented the ambiguity of diploid genotyping in individuals with low
read depth. The haploid genotypes were then concatenated to nucleotide
strings in fasta format.
The genotype data above were derived from SNPs chosen at random across
the genome. However, both marine-freshwater and acidic-basic divergence
in stickleback involves selection on numerous loci across the genome
(Jones 2012b; Roesti et al. 2014; Bassham et al. 2018; Haenel et al.
2019; Terekhanova et al. 2019; Fang et al. 2020). To assess to what
extent natural selection influences population structure, we
additionally explored the genetic relatedness among our marine and
freshwater individuals based on a subset of indSeq SNPs filtered to
reduce the influence of selection. Following the strategy of Haenel et
al. (2019), we excluded SNPs exhibiting an absolute allele frequency
difference (AFD; Berner 2019) greater than 0.4 in both a global
marine-freshwater comparison performed by pooling two random nucleotides
drawn from the pileup of each individual at each SNP within the marine
versus freshwater group of individuals, and in the acidic-basic
comparison described below. As the latter included a MAF threshold of
0.25, we applied the same threshold in the marine-freshwater comparison.
Moreover, we here considered exclusively SNPs located within the
peripheral 5 Mb of each chromosome (Berner & Roesti 2017). These
regions display particularly high recombination rates in stickleback
(Roesti et al. 2013; Glazer et al. 2015), hence are those least affected
by hitchhiking (linked selection). The 120,448 SNPs passing these
filters were treated as above to obtain haploid genotype strings. We
hereafter call the randomly chosen genotype data ‘Random SNPs’ and the
markers chosen to reduce the footprint of selection ‘Neutral SNPs’,
emphasizing that in the latter, a signal of selection may still persist.
For an earlier investigation of the genetic relatedness among North Uist
stickleback based on poolSeq data, we used synthetic multilocus
genotypes generated by concatenating alleles drawn from RAD sequenced
sample pools (Haenel et al. 2019), thereby erasing individual-level
haplotype structure. To assess the value of such synthetic genotypes for
capturing genetic structure among populations, we here pooled the
nucleotide counts at a number of random and neutral SNPs matching the
individual-level data described above. We then drew a single nucleotide
per sample location according to the observed pooled allele frequencies,
and saved these draws concatenated to a single haploid nucleotide string
per location in fasta format. The synthetic genotype data produced in
this way allowed comparing genealogies based on truly individual-aware
versus synthetic genotypes derived from the same SNP panel.
Based on the genotype files, genealogies were generated by using theape (v5; Paradis & Schliep 2018) and phangorn (v2.5.5;
Schliep 2011) R packages. We determined the most appropriate models of
sequence evolution, constructed maximum likelihood genealogies, and
visualized them as unrooted phylograms. Node support was determined
based on 500 bootstrap iterations. As an alternative to phylograms, we
also considered exploring population structure by ordination (PcoA).
However, the proportion of variation captured by the first ordination
axes was consistently small (c. 8% or less). We therefore considered
ordination an ineffective tool for pattern recognition.