Material and methods

Study populations and sampling procedure

A total of 234 pike from 11 populations (N = 8 – 27) were sampled for this study. The locations of the populations ranged from 63.6 – 54.9 °N, and represented all three spawning ecotypes (freshwater, anadromous, and resident; for details see Figure 1and Table 1 ). Our initial plan was to include both anadromous and resident populations throughout the latitudinal range. However, due to the aforementioned decreases of pike in the Baltic Sea, many of the resident populations are no longer found in their former spawning locations, and we were therefore not able to include more resident populations. Individuals were captured using fyke nets, rod-and-reel fishing, or electrofishing, and a non-lethal DNA sample (fin clip) was collected for each individual before they were released back into the water. The fin clips were placed in separate 1.5 mL Eppendorf tubes with 70 % ethanol, which were stored in a freezer (-20 °C) until the molecular work was conducted.

Molecular workflow

DNA was extracted from fin tissue with DNeasy blood and tissue kit (Qiagen, USA), and digested with HF EcoRI (New England Biolabs, USA). Size-selection, library preparation, sequencing (using either Illumina NovaSeq 6000 2 x 150 bp or Illumina HiSeq 2500 2 x 125 bp, Table 1 ), demultiplexing and quality control with MultiQC (Ewels, Magnusson, Lundin, & Käller, 2016) were performed by Science for Life Laboratory (Stockholm, Sweden). Sequence data for Harfjärden, Okne and Lervik was previously published (Sunde et al., 2020a) and retrieved from the NCBI Sequence Read Archive (Sunde, Yildirim, Tibblin, & Forsman, 2020b). In total this yielded a dataset of ~ 3,200 M raw reads (mean 13.6 M per sample). Quality filtering of the raw reads was conducted using Trimmomatic (Bolger, Lohse, & Usadel, 2014) and process_radtags in the Stacks pipeline (Catchen, Amores, Hohenlohe, Cresko, & Postlethwait, 2011; Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). The ~ 3,000 M reads (95 %) that passed the quality filtering, were further processed with the integrated approach (Paris, Stevens, & Catchen, 2017) of the Stacks de novo pipeline, using parameter settings determined by an initial parameter optimization (Figure S1-S2 ). This yielded a final dataset of 5993 biallelic SNPs, for which missing data was imputed using Beagle (Browning, Zhou, & Browning, 2018) before it was used in the downstream analyses. For details on the molecular workflow see section ‘Supplementary methods’ in the Supplementary materials.

Investigations of neutral genetic variation

Analysis of neutral genetic diversity and population structure

The Populations software in the Stacks pipeline was used to obtain estimates of number of private alleles, observed heterozygosity (HO ), expected heterozygosity (HE ), and fixation index (Fis ) for each population. The same software was used to obtain estimates of the fixation index (FST ) by Weir and Cockerham (1984), that was used to assess pairwise population differentiation.
To assess genetic structuring, the full dataset was analysed using multiple approaches. To determine the most likely number of genetic clusters (K ), fastSTRUCTURE (Raj, Stephens, & Pritchard, 2014) was used. Because fine-scaled differentiation might be concealed by differentiation on higher levels (Evanno, Regnaut, & Goudet, 2005), we also ran the analysis with a subset of only the Baltic Sea populations (anadromous and resident) to test if further differentiation became evident. A Principal Component Analysis (PCA) based on pairwise genetic similarity between individuals (proportion of alleles shared) was visualized with the prcomp function in RStudio 2 (RStudio Team, 2015) with R (R Core Team, 2012) . For details about settings see ‘Supplementary methods’ in the supplementary materials.
To determine whether the full dataset was representative of neutrally evolving diversity, the results were compared to those generated by a subset ’neutral’ dataset. This ’neutral’ dataset was created by excluding the 5 % tails of the FST distribution from the full dataset, which should exclude loci under strong selection (both divergent and balancing). Patterns of genetic structure indicated by such a neutral dataset should therefore be reflective of only stochastic processes. The comparison revealed that the results obtained for the full and the ’neutral’ datasets were qualitatively similar (not shown), indicating that the full dataset was representative of neutral diversity. We therefore chose to proceed with the full dataset for the comparisons with adaptive genetic variation and structure (see below).

Isolation by distance

The full dataset for only the Baltic Sea samples (anadromous and resident populations) was used to test for the presence of a correlation between geographical and genetic distances (FST ) between the populations. The freshwater populations were excluded because we did not expect them to exhibit an IBD pattern because of geographical isolation among the lakes. We performed a Mantel test (Mantel, 1967) in Arlequin (Fdist; Excoffier & Lischer, 2010), the significance was tested with 1,000 randomization, and the correlation was visualized in ggplot2 (Wickham, 2016) in R. An additional Mantel test was performed for only the anadromous populations to test for potential correlation between geographical and genetic distances within this ecotype.

Testing for effects of environmental variables

To investigate the roles of salinity (ecotype) and/or latitude for neutral genetic structure, we used the similarity matrix (for the full dataset) created for the PCA in a distance-based redundancy analysis (db-RDA), which is a constrained version of a PCA (db-RDA; Legendre & Anderson, 1999). Because potential adaptations to salinity either could have resulted from differences in the spawning habitats, or reflect differences in salinity in the foraging areas, we wanted to account also for the salinity gradient in the Baltic Sea in the analysis. To this end, we chose to use midrange salinity (mean value of minimum and maximum salinity experienced during the life-cycle) as a proxy for ecotype (salinity for each population was obtained from Bendtsen et al. (2007); see Table 1 ). This resulted in that all freshwater populations were assigned a value of 0, the resident population the highest value of 12, and the anadromous values in the intermediate range of 1.5 to 6.0.
To make the variables comparable for the db-RDA, both environmental variables (latitude and midrange salinity) were normalized (to a standard deviation of 1; by first subtracting the normal value from the observed value, and then dividing by the standard deviation of all observed values) before running the db-RDA. For the normalization we used 0 (psu) as the normal value for salinity (because pike is of freshwater origin; Craig, 1996; Raat, 1988), and the mean latitude as normal value for the latitude. The db-RDA was run in the vegan package (Oksanen et al., 2019) in R, and the statistical significance were assessed with 999 permutations.

Phylogenetic analysis

The evolutionary relationship among the samples was investigated with a maximum likelihood (ML) based phylogenetic inference in RAxML v8.2.10 (Stamatakis, 2014). This was done by using the full dataset following the steps described in ’The Hard & Slow Way’ in the RAxML manual (Stamatakis, 2016), which includes optimizing the parameter settings for number of rate categories and initial rearrangement. After the optimization, the final analysis with 200 inferences and 1,000 bootstraps was run using 10 as initial rearrangement, 25 rate categories, the GTRCAT model of nucleotide substitution, and using the E. lucius genome published by Rondeau et al. (2014) as outgroup. The phylogenetic tree was then visualized using Interactive Tree Of Life (iTOL)(Letunic & Bork, 2019).

Investigations of adaptive genetic variation

Identification of loci putatively under selection

The full dataset was used in the outlier analyses to search for loci putatively under selection. To test for locus-specific effects, populations were introduced as separate groups, and the data was analysed using multiple approaches, in three different software (BayeScan, Fdist, and LOSITAN; for details about settings see Supplementary methods in the supplementary materials). Only loci identified as outliers by all three software were retained. This was done because differences in the algorithms and assumptions of the approaches might lead to different SNPs being identified as outliers; and by conservative selection of only overlapping ones, it is more likely that the identified outliers are true positives (de Villemereuil, Frichot, Bazin, François, & Gaggiotti, 2014).
To test for correlations between selection and ecotype, we utilized BayeScEnv (de Villemereuil & Gaggiotti, 2015), which aims at differentiating signals of selection from those of demographic processes by searching for associations between co-variates (environmental variables) and allele frequencies. A benefit with BayeScEnv when searching for outlier loci associated with co-variates is that the software can simultaneously evaluate the roles of two environmental variables in a single analysis. Thus, using this software enabled us to search for outlier loci associated with midrange salinity after statistically accounting for variation due to latitudinal differences. For this, we used the same normalized values for the two environmental variables (midrange salinity and latitude) as in the db-RDA (for details see subsection ’Testing for effects of environmental variables’ above).
Analysis of adaptive genetic diversity and structure, and testing for effects of environmental variables
To compare the effects of neutral and adaptive processes on patterns of genetic diversity, a subset dataset that should represent adaptive variation (henceforth referred to as the ‘adaptive dataset’) was also analyzed. This adaptive dataset was created by selecting the 17 loci that were identified as outliers by all three software used for the outlier analyses with population grouping, and it should therefore be reflective of deterministic processes such as selection. To obtain estimates of HO , and HE , the populations software in Stacks was used. The adaptive dataset was also analyzed using the same fastSTRUCTURE and db-RDA procedures as used for the full dataset. For details on procedures see ‘Supplementary methods´ in the supplementary materials.