Marker design and statistics
To develop markers universally effective in diverse germplasm, rhAmpSeq markers were designed to target the core Vitis genome, considering several features in the design (Figure 2). First, to decrease off-target amplification, primer binding sites must be unique and avoid sequence variation. Second, the polymorphism level of sequence amplified by primers should be considered. Markers targeted regions with reduced polymorphism have low power to distinguish different taxa, while the markers targeted regions with elevated polymorphism levels usually exhibit a higher mismatching rate in the reads mapping. To base this design on more taxa than the ten genomes discussed above, we downloaded from the public data sources resequencing data of 47 Vitis accessions with sequencing depth greater than three fold, and we resequenced eight Vitis accessions. Principal component analysis revealed the genetic background of wild species was substantially more diverse than that of cultivated lines (Supplementary figure 2). To have a balanced composition of wild species and cultivars in the sample, we randomly selected twenty accessions for each group. The median polymorphism across 40 accessions for core genome was 0.032, and comparing to the core genes, there was no significant difference in polymorphism level by Wilcoxon-Mann-Whitney test (Supplementary figure 3). To focus on regions with moderate polymorphism level, we discarded the regions outside 25th to 75th percentile range. We also compared the genotype missing rate of SNPs in the core genome versus the dispensable genome regions for these forty accessions, as expected, the core genome regions had significantly lower missing rate than those in the dispensable genome (a supplementary figure will be included). The last consideration of marker design was physical distribution across the genome. First, markers were randomly chosen from core genome to obtain one marker per 200 kb. To improve efficiency in gene mapping, we designed more markers for gene-rich regions. The candidate regions were analyzed at IDT for primer design, and 99.6% of the candidate regions successfully produced primers with product size ranges from 250 to 280 bp. Of these, 98% were predicted to be multiplex-competent in one reaction. A total of 2,000 rhAmpSeq markers were designed and synthesized by IDT.
Marker validation in four F1 or F2 families
To evaluate the performance of the 2,000 rhAmpSeq markers, they were genotyped in four breeding families representing the genetic diversity of US breeding practice, including wine grape, table grape, wild species, and interspecific hybrids. First, to examine amplification and sequencing bias in the rh-PCR, we calculated the average read depth for each marker (Figure 3a). After log-10 transformation, sequencing depth was nearly normal in distribution, and 90% of markers ranged from 1 to 100 fold, which indicates that the amplification for the majority of markers was efficient, and depth was enough for genotyping. Secondly, we checked the reproducibility of the rhAmpSeq platform in generating similar data quantities among 96-well plates of samples, including different DNA extraction protocols and Illumina sequencers. The average sequencing depth per sample was greater than ten for all 96-well plates (Figure 3b). Low quality DNA of the MN family extracted with an automated magnetic bead pipeline returned less depth than other families, which were processed manually with high quality filter columns. Similarly, the HC family was sequenced using MiSeq 2,000, which has a lower output and generated less depth than the other samples on HiSeq. Thirdly, we examined correlation of the sequencing depth between two families. Excluding markers with depth less than one, the Pearson correlation coefficient (r) for the test markers was 0.78, which indicated the sequencing depth was mainly determined by the composition of the probe and the target sequence, and less so by the genetic background (Figure 3c).