3 | RESULTS AND DISCUSSION
3.1
| Genome sequencing
A single plant of Vigna radiata var. Sulv 1 was used for genome
sequencing. To achieve a high-quality Vigna radiata var. Sulv 1
genome assembly, we used a combination of sequencing methods including
Oxford Nanopore sequencing Technology (ONT), Illumina sequencing and
Hi-C mapping. A total of ~ 122.9 Gb sequencing data
(equivalent to 259.5 X genomic coverage) was generated.
3.2
| Genome assembly
For the Nanopore sequencing data, we first corrected the errors in the
sequencing reads with the help of Canu software, then genome assembly
was conducted using wtdbg2 software platform based on the corrected
data. This draft genome assembly was first calibrated through Racon with
the help of Nanopore reads by 3 rounds of calibration, and we then used
Pilon (v1.21) software to calibrate the draft genome again with the help
of Illumina HiSeq short reads in a 3 rounds of calibration process. The
resulting Nanopore genome assembly was 473.67 Mb in length, composed of
359 contigs, and the contig N50 was 11.32 Mb. The Nanopore assembly
results were summarized in table S1.
Raw Hi-C sequencing data were first filtered to remove adapter sequences
and low quality reads to obtain high quality clean data. BWA aln was
used to map the Hi-C clean data reads against the draft genome with
default parameters. The reads that can be aligned to the assembled
genome are mapped reads. There are 111.97 million unique mapped read
pairs, accounting for 60.45% of the total reads. These unique mapped
read pairs were used to identify the valid interaction pairs and the
invalid interaction pairs mapped to the draft genome by HiC-Pro. The
preliminary assembled draft genome sequence was then further assembled
using valid Hi-C data through LACHESIS, including the grouping, sorting
and orientation of the draft genome sequence, and finally the genome
sequence of Sulv 1 at the chromosome level is obtained. After Hi-C
assembly and manual adjustment, the final assembly of mungbean genome
consists of 470.45 Mb assigned into 11 individual chromosomes that
accounted for 99.32% of the genome (Figure 1), and is highly contiguous
with scaffold N50 at 42.35 Mb and contig N50 at 11.32 Mb (Table 1 and
table S2).
3.3
| Evaluation of assembly
The Nanopore assembly result was evaluated from the following three
aspects. First, to assess the assembly integrity and genome coverage, we
used bwa software to map the short sequence reads obtained from the
Illumina HiSeq sequencing platform to the reference genome. The percent
of reads mapped to the reference genome was 99.07%. Then CEGMA (v2.5,
Parra et al. 2007) software was used to assess the integrity of the
genome assembly. 449 (98.03%) of the 458 conserved core genes for
eukaryotes were present in the assembled genome. Furthermore, the
completeness of our assembled genome was assessed through BUSCO (Felipe
et al. 2015) analysis using generic model. Approximately 92.43% of the
plant orthologs were included in the assembled Sulv 1 mungbean genome
sequences (table S3). These results indicated a high accuracy and
integrity of the mungbean genome assembly.
For the Hi-C data assembled to the chromosome, the genome sequences were
cut into 100 Kb bins with equal length, and then the number of Hi-C read
pairs between any two bins is used as the intensity of the interaction
between the two bins. Within each chromosome group, the intensity of
interaction at the diagonal position is higher than the off-diagonal
position, indicating efficient chromosome assembly of Hi-C data (figure
S2).
3.4
| Protein-coding gene prediction
Protein-coding region identification and gene prediction were conducted
by a combination of ab initio , homology-based, and unigene-based
prediction methods, and aided by the software EVM for the integration of
the prediction results. We used Genscan, Augustus (v2.4), GlimmerHMM
(v3.0.4), GeneID (v1.4) and SNAP (version 2006-07-28) for ab
initio gene prediction. The homology-based prediction was conducted
with GeMoMa (v1.3.1) software. For the unigene-based prediction,
TransDecoder (v2.0) and GeneMarkS-T (v5.1) were used to predict coding
genes after reference genome based mapping using Hisat (v2.0.4) and
Stringtie (v1.2.3), and PASA (v2.0.2) was used for the prediction of
coding genes after de novo transcriptomes assembly. Finally, we used EVM
(v1.1.1) to integrate the prediction results obtained by all the above
three methods, and after modified with PASA (v2.0.2), a total of 33,924
protein-coding regions were constructed (table S4 and table S5). Among
these predicted coding genes, 20,446 were constructed by all three
methods. 6,222 genes can be predicted by both ab initio and
homology-based methods but can’t be constructed by unigene-based method.
In addition, 1,291, 5,248 and 7 coding genes were found to be specific
to ab initio , homology-based, and unigene-based prediction
methods respectively (figure S3).
3.5
| Gene function annotation
We assigned the functions of predicted protein-coding genes through
BLAST (v2.2.31) against NR, KOG, GO, KEGG and TrEMBL database, performed
KEGG pathway gene annotation analysis, KOG functional annotation
analysis and GO gene function annotation analysis (figure S4 and figure
S5). A total of 32,470 genes can be annotated, accounting for 95.71% of
all predicted genes (table S6). Among them, about 56.6% predicted genes
have GO annotations. GO enrichment analysis of gene sets was performed
in Blast2GO against Sulv 1 mungbean genome as reference. Statistical
significance was tested by Fisher’s exact test corrected in multiple
tests using Bonferroni method under false discovery rate (FDR) threshold
of 0.05.
Motifs are short conservation sequences that homologous to regions in
other sequences and perform a similar function. We annotated motifs of
the Vigna radiata genome using InterProScan
(http://www.ebi.ac.uk/Tools/pfa/iprscan/) with default parameters.
InterProScan software combines several protein motifs/domains search
tools together. It allows users to scan protein sequences at one time
against several signature databases including Prosite, PRINTS, PFAM,
ProDom, Smart, TIGRFAMs, SignlP, Trans memberane etc., and also gives GO
annotation. Analysis of protein domains by InterProScan software and
motif searching identified 2,765 motifs and 35,154 domains in theVigna radiata var. Sulv 1 genome.
3.6
| Non-coding RNA annotation
Non-coding RNA includes RNA with a variety of known functions such as
microRNA, rRNA, and tRNA. Different strategies were used to predict
non-coding RNAs according to the structural characteristics of different
kinds of non-coding RNAs. To identify microRNA and rRNA, we used blastn
to perform a genome-wide comparison based on the Rfam database, and tRNA
was identified using tRNAscan-SE software. We finally identified 86
miRNA, 352 rRNA and 653 tRNA belonging to 23, 4 and 22 families
respectively (table S7).
3.7
| Pseudogenes prediction
Pseudogenes have sequences similar to functional genes, but they have
lost their original functions due to mutations such as insertions and
deletions. We searched for possible homologous gene sequences in the
genome with the help of the predicted protein sequences through BLAT
alignment, and then GeneWise was used to search for immature stop codons
and frameshift mutations in the gene sequences to obtain pseudogenes. A
total of 4,320 pseudogenes were identified, with an average length of
2,237 bp.
3.8
| Annotation of repetitive elements
Due to the relatively low conservation of repetitive sequences between
species, it is necessary to construct a specific repetitive sequence
database when predicting repetitive sequences for Vigna radiata .
We used Repbase and a constructed de novo repeat library to annotate
repeat DNA sequences in the Vigna radiata genome. A de novo
repeat library from the assembled Vigna radiata genome was
constructed using LTR_FINDER and RepeatModeler (version open-1.0.8,
http://repeatmasker.org/RepeatModeler/) and Repbase was downloaded
from http://www.girinst.org/repbase/. The database was classified
through PASTEClassifier, and then merged with the Repbase database as
the final repeated sequence database. The repetitive elements in theVigna radiata de novo repeat library and Repbase database were
annotated by RepeatMasker. About 52.83% of the Vigna radiatagenome was annotated as repetitive sequences based on RepeatMasker
output (table S8). The length of the repetitive element type ranged from
46.4 Kb to 215.1 Mb. The most abundant repetitive element repeat type is
long terminal repeat (LTR), making up 33.92% of the genome, including
56.6% Gypsy LTRs, 39.77% Copia LTRs and 3.63% other types of LTRs.
Simple sequence repeats (SSRs) are another type of important tandem
repetitive sequences. We used MISA software to detect SSRs in the
mungbean genome. A total of 224,409 SSRs (136,045 mono-, 56,033 di-,
28,959 tri-, 1,977 tetra-, 1,098 penta-, and 297 hexa-nucleotide
repeats) were detected (table S9). The total length of the SSR sequences
was 3,252,656 bp, accounting for ~0.69% of the
assembled Sulv 1 mungbean genome.
3.9
| Phylogenetic analysis and estimation of divergence time
The assembled and annotated mungbean genome allowed us to investigate
its evolutionary history. Single-copy orthologs among taxa were used to
achieve robust phylogenetic reconstruction with high confidence and
concordance. We identified a set of single-copy orthologs from mungbean
and 10 closely related species including Vigna radiata , cowpea,
common bean, soybean, Vigna angularis , Lablab purpureus ,Medicago truncatula , Lotus japonicus , Vigna
subterranea and Arabidopsis thaliana using OrthoMCL software
(table S10). Based on this ortholog set, a phylogenetic tree of the
eleven plant species was constructed as follows: for each single-copy
gene, a coding sequence alignment was created using MUSCLE and then all
coding sequence alignments were concatenated in MEGA. The concatenated
alignment was then used to construct a maximum likelihood phylogenetic
tree using PHYML. Species divergence time was then estimated by using
the maximum likelihood tree as a starting tree through Mcmctree (Figure
2). We used a fossil calibration with a strict clock rate for the
divergence time estimation.
3.10
| Whole genome duplication in mungbean genome
To investigate the evolution of mungbean, we compared its genome with
four other eudicots: Vigna radiata , Arabidopsis thaliana(Arabidopsis), Vigna unguiculata and Phaseolus vulgaris .
The orthologs between mungbean and these species were identified using
analysis described above. We searched for genome wide duplications in
assembled mungbean genome to study mungbean genome evolution. 4DTv
(4-fold degenerate synonymous sites of the third codons) values were
calculated based on the homologous gene pairs between two species or
within the species itself. The analysis revealed whole-genome
replication events in mungbean genome. A divergence peak was observed
for Vigna radiata vs Arabidopsis thaliana , and another
lower peak was found for Vigna radiata vs common bean
(Figure
3), which suggested that the divergence of mungbean andArabidopsis thaliana occurred earlier than the divergence of
mungbean and common bean.
3.11
| Estimation of LTR insertion time
LTRs were identified in the Sulv 1 genome using LTR_FINDER software.
Mutation rates were used to estimate LTR insertion times. The results
indicated that LTR insertions are not very active in Sulv 1 (Figure 4).