2.8. Outlier association with environmental variables
A potential association between outliers and geographic/environmental variables (Major Archipelagos, Islet surface, Maximum altitude and SDI) was assessed through a permutational multivariate analysis of variance (PERMANOVA) with the function adonis2 in the ‘vegan’ R package (Oksanen et al., 2022). A Euclidean distance matrix among individuals was built on allele frequencies calculated with gl.percent.freqin the ‘dartR’ package (Gruber et al., 2018). Previously, missing genotype values were imputed using the function ‘gl.impute’ in the dartR package. Model selection was performed by first fitting a model with all two-way interactions and all main terms, and then refitting the model after excluding non-significant interactions. Significance of model terms was assessed using marginal tests (1000 permutations). SNPs that significantly discriminated between major archipelagos were identified with logistic regression models using p<0.01 after adjusting p-values to minimize false discovery rate (FDR). A genotype matrix coded as 0/1/2 was obtained using function extract.gt in the ’vcfR’ R package (Knaus and Grünwald, 2017, 2016) and used to visualize geographic clustering of individual SNP outliers by a Principal Components Analysis (PCA).
Results
We generated a total of 15 billion paired-end GBS raw reads for 100 specimens from eight distinct localities/islands (99% passing the filters, Table S2). Newly generated data and previous RADSeq data from Bassitta et al. (2021) were individually processed through the same pipeline and combined for an integrative analysis of all 16 populations. After genome mapping, variant calling and filtering, we obtained a total of 143272 biallelic SNPs for the GBS dataset (100 individuals, 8 localities), 166943 for the RADseq dataset (91 individuals, 10 localities) and 5523 for the COMBINED dataset (191 individuals, 16 localities, including only common loci between GBS and RADSeq datasets) (Table 2). As expected, multiallelic variants represented a minor fraction of the total polymorphic sites (2.1% for GBS and 2.5% for RADSeq) and were excluded from the study.
Both raw and filtered SNPs were widely distributed across the 18 autosomal and two sexual chromosomes (Z and W), with SNP frequencies being proportional to chromosome size (Figure 2; chromosome W representation strictly depends on the number of males in the dataset, see Table S1). Both sequencing methods covered the entire genome diversity, with chromosome representation being highly congruent among the three datasets (Figure 2).