Syntenic core-genome construction
The repetitive regions in both the reference genome and the assemblies were filtered using a kmer frequency-based approach with BBduk, a tool implemented in the BBTools package (version 35.50)
\cite{bushnell2014bbmap}. The sequences with a Kmer (K=31) frequency larger than two were replaced with N. The masked assembly was aligned to the reference genome, PN40024 (version 12X.v2)
\cite{Canaguier2017}, using Minimap2 with presets of parameters tuned for cross-species alignment, denoted as “asm5” in the manual
\cite{Li_2018}. The result is transformed to a bed-like format for chaining and identifying one-to-one matches using quota_alignment(
https://github.com/tanghaibao/quota-alignment)
\cite{Tang2011}. The coverage of the core genome was defined as the number of genomes for which a genomic region was covered by the syntenic one-to-one genome alignment. Two sets of core genome were defined: one set considered all nine genomes, and the other set excluded three wild Vitis species. The core genome coverage, gene density, transposable element (TE) density, and other genome features were calculated with window size of 1 Mb using BEDTools (v2.27.1). The correlation between each pair of genome features was calculated using Spearman’s rank correlation coefficient in R software(Version 3.5.0)
\cite{nokey_07eee}.
Genus-wide variant calling
A total of 47 Vitis accessions with 2- to 93-fold paired-end Illumina sequencing data were downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). If one accession had more than 20-fold read coverage, we randomly down-sampled to 20-fold read coverage for variant calling. We also sequenced 8 accessions with 5- to 10-fold paired-end sequencing using the Illumina HiSeq 2500 platform, with an insert size of 400bp and the read length of 2 × 150bp. The raw sequencing data have been deposited in the SRA at NCBI under BioProject ID: PRJNAXXXXXX. All the information of SRA, including project number, total base pairs, and name of accession are list in supplemental table 2. Variants were called using the Sentieon DNA Software Package (version, Golden helix )\cite{Kendig396325} with default settings. This Sentieon package is a speed-up software that rebuilt the Genome Analysis Toolkit HaplotypeCaller and returns the same result as GATK 3.3. Principal component analyses (PCA) was conducted using R/Bioconductor Package SNPRelate\cite{Zheng2012}. To avoid the strong influence of the SNP clusters in the PCA analysis, SNPs were filtered using LD-based pruning algorithm implemented in SNPRelate with LD threshold 0.2. We arbitrarily select one if the Euclidean distance between a pair of samples is less then 0.05 and kept 20 V. vinifera samples and 20 non-vinifera samples.
Marker design pipeline
The Variant Calling Format (VCF) file generated from the genus-wide variants calling was loaded into R software. For each aligned region in the core-genome, the length, diversity and missing rate was calculated. The regions that were shorter than 200bp, with diversity larger than 7% or smaller than 2%, or with average missing rate large than 50% were removed. These steps were conducted in R using the bioconductor (version 3.8) package VariantAnnotation\cite{Obenchain2014}. The candidate regions were then picked to ensure one marker per 200 kb. If no qualified candidate region could be found in a 1 Mbp window, we included the regions that had highest coverage in the core genome construction. To achieve a better representation of gene rich regions, we arbitrarily included more candidate regions with high gene density.A total of 2500 candidate regions were sent to Integrated DNA Technologies Inc. (IDT, Coralville, IA, USA) for primer design and pooling compatibility test. Primers could be designed for 99.6% of the regions, and 98.1% of them were pooling compatible in one-tube-PCR. A total of 2,000 rhAmpSeq markers were synthesized by IDT.