Materials and methods

Global sample collection

Due in part to its extensive distribution and phenotypic diversity,Plantago major has a complex taxonomic history with over 50 lower taxonomic divisions and synonyms having been proposed (Pilger, 1937; Mølgaard, 1976; Peruzzi & Passalacqua, 2003; POWO 2019). At least three subspecies are recognized in Europe, two of which, P. majorsubsp. major and P. major subsp. intermedia(Gilib.) Lange, have been widely studied and even recognized as separate species (P. major and P. intermedia DC.; Morgan-Richards & Wolff, 1999; POWO, 2019). The full extent of the geographic ranges occupied by each subspecies, and overlap between them, is poorly known, particularly outside of Northern Europe. However, we focused our sampling on P. major subsp. major which is thought to be more widely distributed and tolerant of a wider range of environmental variation and anthropogenic disturbance (Mølgaard, 1976). Plantago major subsp. major can be discriminated from P. majorsubsp. intermedia based on the number of seeds per capsule (Mølgaard, 1976; van Dijk, 1984; Wolff, 1991). Only specimens fitting these characters were selected for our study.
Fifty populations of naturally occurring Plantago majorsubsp. major plants were sampled worldwide, including 22 populations from across the native range (Europe, Asia) and 28 populations from the introduced range (North and South America, Iceland, Greenland, Hawaiʻi, Australia, New Zealand, and Africa) (Figure 2 ; Table 1 ). Six to ten individuals were sampled from each population, amounting to 385 individuals in total. A herbarium voucher was sampled from each population to confirm species identity and deposited in Herbarium C at the Natural History Museum of Denmark, University of Copenhagen, in Copenhagen, Denmark. Herbarium vouchers were not obtainable for Peru; however digital photos were taken in lieu.

DNA extractions

Leaf tissue was pulverized with ceramic beads in a Qiagen TissueLyser® (Qiagen, Germany) and DNA extractions were prepared using a modified Qiagen DNEasy® Mini Plant Kit (Qiagen, Germany) or a CTAB protocol (conducted by ADNid, Montpellier, France). DNEasy Mini Plant Kit extractions were modified by adding 50 μL of proteinase K to the cell lysis solution after 10 min incubation at 65 °C, followed by 2 hours of incubation at 45 °C. Double extractions were made for each sample, then pooled, to ensure the quantity and quality of genomic DNA was sufficient for genotyping-by-sequencing (GBS).

Sequencing

In total, DNA extracts from 385 individuals were sequenced (Table 1 ). GBS library preparation was performed at the Genomic Diversity Facility at Cornell University, following Elshire et al. (2011). DNA from each sample was digested using the restriction enzyme PstI (CTGCA^G), and both a sample-specific barcoded adapter and a common adapter were ligated to the sticky ends of fragments. Samples were pooled and fragment libraries cleaned using a QIA-quick PCR purification kit (Qiagen, USA). Libraries were amplified using an 18-cycle PCR with long primers complementary to the barcoded and common adapters, purified again using QIAquick, and quantified using a PicoGreen assay (Molecular Probes, Carlsbad, CA, USA). Samples were run on seven different plates (96-wells), and one blank was included per plate. Plates were run on lanes of a 100-basepair single-end Illumina HiSeq 2000, at the Cornell Core Laboratories Center (Cornell University, New York, USA).

Assembly of pseudo-reference genome (catalogue)

stacks, a software pipeline designed for restriction-enzyme based data for organisms without a reference genome, was used to generate a catalogue or pseudo-reference genome. The sequencing reads for 392 samples (385 samples plus a blank from each of the seven flow batches of samples) were demultiplexed using the process_radtags function in stacks v.1.45 (Catchen et al., 2013). The demultiplexing process was run in an iterative manner, with the discarded reads at each step being fed as input to the next process_radtags step with a smaller barcode size, to retain the maximum number of reads per sample. During the demultiplexing process, we rescued barcodes (using the -r option), and retained reads that did not match any barcode of the given length in a separate file. The parameter –adapter_1 was used to identify common adapter sequence, the maximum adapter mismatch was set to 2, while we required a perfect match on the barcodes.
We selected 20 samples with the maximum number of reads, while maximizing the number of sampling locations spanned, to build a reference catalogue using Cstacks in stacks v.1.45 (see Table 1 ). These samples were processed using Ustacks, with the developing algorithm enabled (-d option), while retaining unused reads (-r option) and requiring a minimum of three reads to build a stack (-m 3). The resulting output was used to run Cstacks with four mismatches required between different loci (-n 4) when building the catalogue. We also allowed gapped alignment between the loci, with a maximum gap length of four (–gapped, –max_gaps 4). This results in a minimum Hamming distance of 4 between any pair of loci in the reference catalogue. From the loci identified in the reference catalogue, we created a pseudo-reference genome by collapsing these loci into 10 chromosomes, with a string of 150 Ns inserted between consecutive loci.

Mapping to the pseudo-reference genome (catalogue)

Raw reads were demultiplexed using the demultiplexer function in gbsx v.1.3 (Herten et al., 2015), allowing for one mismatch in the enzyme, and one mismatch in the barcode (-me 1 and -mb 1). Demultiplexing with gbsx retained a larger fraction of reads than stacks, which in turn allowed us to obtain a larger set of variants for downstream analyses. Therefore, the demultiplexed reads obtained from gbsx were used to map reads to the catalogue assembled in Cstacks. The mapping was performed using the paleomix pipeline, which was run with default parameters (Schubert et al., 2014). As part of the paleomix pipeline, the reads were first filtered for adapter sequences using adapterremoval2 v. 2.2.0 (Schubert et al., 2016), and these trimmed sequences were mapped against the pseudo-reference genome using the backtrack algorithm in bwa v. 0.7.15 (Li & Durbin, 2009). The blanks from each of the seven plates also were included to ensure that they did not map to the pseudo-reference genome. Blanks were then excluded from our downstream analyses.

SNP-calling and genotype likelihoods

The alignments generated by bwa were used to identify SNPs in our samples. Since GBS data inherently has very high variance in terms of coverage of loci and depth of coverage among and within loci, there is high uncertainty associated with the calling of genotypes at variant sites. In order to propagate this uncertainty to downstream analyses, genotype likelihood-based methods were used, which allow us to account for this uncertainty in our analyses (Nielsen et al., 2013). The SNP locations and genotype likelihoods for the samples at these locations were computed using angsd v. 0.921 (Korneliussen et al., 2014). A total of 385 samples were used to identify variant positions (we excluded blanks in all further analyses). As a pre-processing step, we assessed the distribution of the depth of coverage of a randomly chosen subset of 100 samples to assess the cut-off for the maximum number of reads that can cover a single position. We discarded reads that mapped to multiple positions and low-quality bases and reads (quality score less than 20). Using the distribution of depths obtained from angsd, we chose a maximum average depth cut-off of 70 X coverage per sample. Variant discovery and computation of genotype likelihoods was performed using angsd with the same parameters as above, but additionally a depth cut-off of 70 per sample (-maxDepth NumberOfSamples*70), a minimum of 50% of the samples must have at least one read covering the site (-minInd NumberOfSamples*0.5), a minimum SNP p-value of 10-6 (-SNP_pval 1e-6), and a penalty for poorly aligning reads that have a lot of mismatches with the reference locus (-C 50). The genotype likelihoods were calculated using the model described in samtools (-GL 1; v. 1.4; Li, 2011).

Phylogenomic inference

Pairwise genetic distances between individuals were examined using ngsdist (Vieira, Lassalle, Korneliussen, & Fumagalli, 2015), a distance-based phylogenetic inference method that takes genotype-likelihoods into account and uses the genotype posterior probabilities estimated in angsd. Pairwise genetic distances among the global samples were computed using SNPs that had a minor allele frequency greater than 10% in the sample, resulting in 2807 total sites. In addition, for each pair of samples, all sites that did not have any reads covering them in either of the samples were discarded. Pairwise distances were visualized as multidimensional scaling (MDS) plots using the cmdscale function in base R (v 3.6.2) to compute the first six coordinates.

Admixture analysis

To identify population structure and identify patterns of admixture among our samples an individual-based assignment test was performed using ngsadmix (Skotte et al., 2013; Fumagalli et al., 2014) using 7594 SNPs. ngsadmix is a maximum likelihood method that uses the genotype likelihood data obtained from angsd. Our analyses were run with the number of ancestral populations, K , set from two to 12 (K =2 to K =12). For K =2 toK =8, analyses were replicated 200 times, and for K =9 toK =12 the analyses were replicated 500 times to ensure convergence to the global maximum. The replicate with the highest log-likelihood among replicates was chosen for each K . A test statistic for determining the optimal K value is not an available option in ngsadmix. Therefore, results for each value of K were reviewed and compared with results from the MDS plots to select aK value that had the most biological relevance based on the genomics of the species being studied.

Inference of population splits and migration events

treemix v. 1.13 (Pickrell and Pritchard, 2012) was used to visualize how well the relationships can be represented by a bifurcating tree using population allele frequencies. An in-house script in python was written to convert Beagle files generated in angsd to treemix input files (See Supplementary Materials). In addition to admixture, treemix also provides some information about the directionality of geneflow (Patterson et al, 2012). Trees were rooted (-root) using populations from Yukon, Japan, and South Korea (based on output from admixture analyses suggesting that these populations are outgroups to the others), and we estimated the likelihood of graphs with 0 – 4 migration events added in order to visualize the proportion and direction of geneflow events between the 50 sampled populations.