Results

A highly continuous genome assembly of Chinese flowering cabbage (B. rapa var.parachinensis )

A highly inbred line of Chinese flowering cabbage (B. rapa var.parachinensis , Fig.1) was used for the genome sequencing and assembly with deep coverage long reads and Hi-C data. The assembly pipeline for Brassica rapa var. parachinensis genome was shown in Fig.1. DNA samples from a single plant were prepared for PacBio, Illumina and Hi-C sequencing to avoid potential genome variability between different plants. Overall, we obtained a total of 113Gb PacBio and 47.5Gb Illumina raw reads (Table S1), corresponding to 219 and 86 depth of the estimated genome size (515 Mb), respectively. A preliminary survey of the genome size, heterozygosity, GC and transposon elements (TEs) content of this inbred line was carried out with 32GB clean illumina reads (Table 1; ~83 coverage) using Kmer-based method (Liu et al. 2013). The genome size was estimated to be about 515Mb with an overall GC content of 38.9% and transposon elements (TE) content of 64.1% (Table S1). Remarkably, the heterozygosity is very low with only 0.16% that would facilitate assembly.
We applied an integrated strategy to assemble the genome. Firstly, the MECAT2 package(C.-L. Xiao et al., 2017) was used for the Chinese flowering cabbage genome assembly. Secondly, long reads with a length cutoff of 10 kb were polished using NGS short reads with a Pilon(Walker et al., 2014). Finally, we obtained the final contig assembly of 384Mb with a contig N50 length of 7.2Mb. The genome contained 450 contigs, and the longest contig was 19.9Mb (Table 1). The GC content for the genomic contigs were 37.6% (Table 1). The results of coverage statistics by SAM tools suggested that the assembly of this genome is credible (Table S2). Furthermore, we found that 97.8% and 0.8% of the completed and partial genes of the total of 1,440 BUSCO genes were detected in the genome, respectively, which validated the completeness of the genome (Table S3).
Furthermore, high-throughput chromatin conformation capture (Hi-C) data was used to scaffold the contigs into chromosome-level assembly. We obtained a total of 66 Gb cleaned Hi-C paired-end (PE) reads which is about 128 depth of the genome. Of which, 98.27% (434M/442M) were mappable to the current assembly and ~33.18% (147M/442M) were mapped to different contigs. Using contact frequency calculated from the PE reads, 180 contigs were further scaffolded into 10 pseudo-chromosomes (Fig. 1A). These 180 contigs represent 87.93% (338 Mb/384Mb) of the total assembled sequence and 40% (180/450) of the total contigs. The final assembly contains 69 scaffolds with a scaffold N50 of 32Mb and the longest scaffold is 47.5Mb in length (Table 1). The Circos map of the genome shows that each position is collinear with the other two, indicating that the annotation is complete (Fig.1B). A large number of corrected repeat regions on A05 and A06 chromosomes were identified (Fig.1C), which indicated that there might be a large region of DNA transposons and LTR transposons at this region.
We also performed de novo gene prediction with guidance by homologs from related species, transcriptome from short read data and full-length transcripts from ISO-seq sequencing from the present study using the MAKER pipeline(Cantarel et al., 2008). We annotated 47,598 protein-coding genes in the Chinese flowering cabbage genome with an average gene length of 2060 bp (Table 1). The average number of exons per gene is 6.13, with a mean length of 199 bp (Table 1). Approximately 53.2% of the genome is annotated as repetitive sequences, which is consistent with the estimation of Kmer-based method. LTR retrotransposons (22.26 %) and DNA transposons (17.62 %) are the most abundant families (Table S4).
In conclusion, we provide, to our knowledge, so far the most contiguous and the first chromosome-level genome assembly of this species.