Genome assembly, annotation, and repetitive sequences characterization
We assembled a highly heterozygous (1.19 %) genome of C. rotundifolia , by combining the 39.38 gigabases (Gb) of PacBio Sequel sequences (~ 106 ×) and 28.31 Gb of Illumina paired-end reads (~ 77 ×) (Figure S1, Table S5). We arranged 3,289 contigs (contig N50 = 186 Kb) based on the spatial relationship deduced from 130.44 Gb of Hi-C assay data (~ 362 ×) (Table S6). A total length of 350.69 Mb scaffolds was ordered and anchored onto 12 pseudo-chromosomes with scaffold N50 up to 27.6 Mb, covering 94.53 % of the assembled genome (Figure 1c, Figure S1, Table S7). We identified 169,723 homozygous mutation bases representing 0.045 % of assembled genomes (one error per 2.22 Kb).
A total of 30,824 protein-coding genes were predicted by using a combination of ab initio, transcript evidence, and homology-based methods. We used Swissport, NCBI, GO, KEGG, and eggNOG databases to annotate approximately 82.15 % of the coding genes (Table S8). Moreover, Benchmarking Universal Single-Copy Orthologs analysis suggested that 92.4 % of the genes could be recovered (Table S9). In addition, we identified 692 transfer RNAs, 128 microRNAs, 232 ribosomal RNAs (18S, 28S, 5.8S, and 5S), and 971 small nucleolar RNAs (Figure S2).
Repetitive sequences dominated 47.41 % of the genome, of which 31.07 % were long terminal repeat (LTR) elements (Table S10). Estimates of sequence divergence times between the adjacent 5′ and 3′ LTRs of the same retrotransposon suggested a very recent burst of activity in less than 90.77 thousand years ago (kya) and much severe invasion than in grape (Figure 1d, Table S10). Further, we found 584,679 (12.90 Mb) simple sequence repeats (SSRs) with six as the most abundance unit size, slightly less than that in V. vinifera (PN40024, 930,680, 23.05 Mb) (Table S11).