3 | RESULTS AND DISCUSSION
3.1 | Genome sequencing
A single plant of Vigna radiata var. Sulv 1 was used for genome sequencing. To achieve a high-quality Vigna radiata var. Sulv 1 genome assembly, we used a combination of sequencing methods including Oxford Nanopore sequencing Technology (ONT), Illumina sequencing and Hi-C mapping. A total of ~ 122.9 Gb sequencing data (equivalent to 259.5 X genomic coverage) was generated.
3.2 | Genome assembly
For the Nanopore sequencing data, we first corrected the errors in the sequencing reads with the help of Canu software, then genome assembly was conducted using wtdbg2 software platform based on the corrected data. This draft genome assembly was first calibrated through Racon with the help of Nanopore reads by 3 rounds of calibration, and we then used Pilon (v1.21) software to calibrate the draft genome again with the help of Illumina HiSeq short reads in a 3 rounds of calibration process. The resulting Nanopore genome assembly was 473.67 Mb in length, composed of 359 contigs, and the contig N50 was 11.32 Mb. The Nanopore assembly results were summarized in table S1.
Raw Hi-C sequencing data were first filtered to remove adapter sequences and low quality reads to obtain high quality clean data. BWA aln was used to map the Hi-C clean data reads against the draft genome with default parameters. The reads that can be aligned to the assembled genome are mapped reads. There are 111.97 million unique mapped read pairs, accounting for 60.45% of the total reads. These unique mapped read pairs were used to identify the valid interaction pairs and the invalid interaction pairs mapped to the draft genome by HiC-Pro. The preliminary assembled draft genome sequence was then further assembled using valid Hi-C data through LACHESIS, including the grouping, sorting and orientation of the draft genome sequence, and finally the genome sequence of Sulv 1 at the chromosome level is obtained. After Hi-C assembly and manual adjustment, the final assembly of mungbean genome consists of 470.45 Mb assigned into 11 individual chromosomes that accounted for 99.32% of the genome (Figure 1), and is highly contiguous with scaffold N50 at 42.35 Mb and contig N50 at 11.32 Mb (Table 1 and table S2).
3.3 | Evaluation of assembly
The Nanopore assembly result was evaluated from the following three aspects. First, to assess the assembly integrity and genome coverage, we used bwa software to map the short sequence reads obtained from the Illumina HiSeq sequencing platform to the reference genome. The percent of reads mapped to the reference genome was 99.07%. Then CEGMA (v2.5, Parra et al. 2007) software was used to assess the integrity of the genome assembly. 449 (98.03%) of the 458 conserved core genes for eukaryotes were present in the assembled genome. Furthermore, the completeness of our assembled genome was assessed through BUSCO (Felipe et al. 2015) analysis using generic model. Approximately 92.43% of the plant orthologs were included in the assembled Sulv 1 mungbean genome sequences (table S3). These results indicated a high accuracy and integrity of the mungbean genome assembly.
For the Hi-C data assembled to the chromosome, the genome sequences were cut into 100 Kb bins with equal length, and then the number of Hi-C read pairs between any two bins is used as the intensity of the interaction between the two bins. Within each chromosome group, the intensity of interaction at the diagonal position is higher than the off-diagonal position, indicating efficient chromosome assembly of Hi-C data (figure S2).
3.4 | Protein-coding gene prediction
Protein-coding region identification and gene prediction were conducted by a combination of ab initio , homology-based, and unigene-based prediction methods, and aided by the software EVM for the integration of the prediction results. We used Genscan, Augustus (v2.4), GlimmerHMM (v3.0.4), GeneID (v1.4) and SNAP (version 2006-07-28) for ab initio gene prediction. The homology-based prediction was conducted with GeMoMa (v1.3.1) software. For the unigene-based prediction, TransDecoder (v2.0) and GeneMarkS-T (v5.1) were used to predict coding genes after reference genome based mapping using Hisat (v2.0.4) and Stringtie (v1.2.3), and PASA (v2.0.2) was used for the prediction of coding genes after de novo transcriptomes assembly. Finally, we used EVM (v1.1.1) to integrate the prediction results obtained by all the above three methods, and after modified with PASA (v2.0.2), a total of 33,924 protein-coding regions were constructed (table S4 and table S5). Among these predicted coding genes, 20,446 were constructed by all three methods. 6,222 genes can be predicted by both ab initio and homology-based methods but can’t be constructed by unigene-based method. In addition, 1,291, 5,248 and 7 coding genes were found to be specific to ab initio , homology-based, and unigene-based prediction methods respectively (figure S3).
3.5 | Gene function annotation
We assigned the functions of predicted protein-coding genes through BLAST (v2.2.31) against NR, KOG, GO, KEGG and TrEMBL database, performed KEGG pathway gene annotation analysis, KOG functional annotation analysis and GO gene function annotation analysis (figure S4 and figure S5). A total of 32,470 genes can be annotated, accounting for 95.71% of all predicted genes (table S6). Among them, about 56.6% predicted genes have GO annotations. GO enrichment analysis of gene sets was performed in Blast2GO against Sulv 1 mungbean genome as reference. Statistical significance was tested by Fisher’s exact test corrected in multiple tests using Bonferroni method under false discovery rate (FDR) threshold of 0.05.
Motifs are short conservation sequences that homologous to regions in other sequences and perform a similar function. We annotated motifs of the Vigna radiata genome using InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan/) with default parameters. InterProScan software combines several protein motifs/domains search tools together. It allows users to scan protein sequences at one time against several signature databases including Prosite, PRINTS, PFAM, ProDom, Smart, TIGRFAMs, SignlP, Trans memberane etc., and also gives GO annotation. Analysis of protein domains by InterProScan software and motif searching identified 2,765 motifs and 35,154 domains in theVigna radiata var. Sulv 1 genome.
3.6 | Non-coding RNA annotation
Non-coding RNA includes RNA with a variety of known functions such as microRNA, rRNA, and tRNA. Different strategies were used to predict non-coding RNAs according to the structural characteristics of different kinds of non-coding RNAs. To identify microRNA and rRNA, we used blastn to perform a genome-wide comparison based on the Rfam database, and tRNA was identified using tRNAscan-SE software. We finally identified 86 miRNA, 352 rRNA and 653 tRNA belonging to 23, 4 and 22 families respectively (table S7).
3.7 | Pseudogenes prediction
Pseudogenes have sequences similar to functional genes, but they have lost their original functions due to mutations such as insertions and deletions. We searched for possible homologous gene sequences in the genome with the help of the predicted protein sequences through BLAT alignment, and then GeneWise was used to search for immature stop codons and frameshift mutations in the gene sequences to obtain pseudogenes. A total of 4,320 pseudogenes were identified, with an average length of 2,237 bp.
3.8 | Annotation of repetitive elements
Due to the relatively low conservation of repetitive sequences between species, it is necessary to construct a specific repetitive sequence database when predicting repetitive sequences for Vigna radiata . We used Repbase and a constructed de novo repeat library to annotate repeat DNA sequences in the Vigna radiata genome. A de novo repeat library from the assembled Vigna radiata genome was constructed using LTR_FINDER and RepeatModeler (version open-1.0.8, http://repeatmasker.org/RepeatModeler/) and Repbase was downloaded from http://www.girinst.org/repbase/. The database was classified through PASTEClassifier, and then merged with the Repbase database as the final repeated sequence database. The repetitive elements in theVigna radiata de novo repeat library and Repbase database were annotated by RepeatMasker. About 52.83% of the Vigna radiatagenome was annotated as repetitive sequences based on RepeatMasker output (table S8). The length of the repetitive element type ranged from 46.4 Kb to 215.1 Mb. The most abundant repetitive element repeat type is long terminal repeat (LTR), making up 33.92% of the genome, including 56.6% Gypsy LTRs, 39.77% Copia LTRs and 3.63% other types of LTRs.
Simple sequence repeats (SSRs) are another type of important tandem repetitive sequences. We used MISA software to detect SSRs in the mungbean genome. A total of 224,409 SSRs (136,045 mono-, 56,033 di-, 28,959 tri-, 1,977 tetra-, 1,098 penta-, and 297 hexa-nucleotide repeats) were detected (table S9). The total length of the SSR sequences was 3,252,656 bp, accounting for ~0.69% of the assembled Sulv 1 mungbean genome.
3.9 | Phylogenetic analysis and estimation of divergence time
The assembled and annotated mungbean genome allowed us to investigate its evolutionary history. Single-copy orthologs among taxa were used to achieve robust phylogenetic reconstruction with high confidence and concordance. We identified a set of single-copy orthologs from mungbean and 10 closely related species including Vigna radiata , cowpea, common bean, soybean, Vigna angularis , Lablab purpureus ,Medicago truncatula , Lotus japonicus , Vigna subterranea and Arabidopsis thaliana using OrthoMCL software (table S10). Based on this ortholog set, a phylogenetic tree of the eleven plant species was constructed as follows: for each single-copy gene, a coding sequence alignment was created using MUSCLE and then all coding sequence alignments were concatenated in MEGA. The concatenated alignment was then used to construct a maximum likelihood phylogenetic tree using PHYML. Species divergence time was then estimated by using the maximum likelihood tree as a starting tree through Mcmctree (Figure 2). We used a fossil calibration with a strict clock rate for the divergence time estimation.
3.10 | Whole genome duplication in mungbean genome
To investigate the evolution of mungbean, we compared its genome with four other eudicots: Vigna radiata , Arabidopsis thaliana(Arabidopsis), Vigna unguiculata and Phaseolus vulgaris . The orthologs between mungbean and these species were identified using analysis described above. We searched for genome wide duplications in assembled mungbean genome to study mungbean genome evolution. 4DTv (4-fold degenerate synonymous sites of the third codons) values were calculated based on the homologous gene pairs between two species or within the species itself. The analysis revealed whole-genome replication events in mungbean genome. A divergence peak was observed for Vigna radiata vs Arabidopsis thaliana , and another lower peak was found for Vigna radiata vs common bean (Figure 3), which suggested that the divergence of mungbean andArabidopsis thaliana occurred earlier than the divergence of mungbean and common bean.
3.11 | Estimation of LTR insertion time
LTRs were identified in the Sulv 1 genome using LTR_FINDER software. Mutation rates were used to estimate LTR insertion times. The results indicated that LTR insertions are not very active in Sulv 1 (Figure 4).