2 | MATERIALS AND METHODS
2.1
| Plant materials
The pure line of an elite cultivar, Sulv 1 was chosen for genome
sequencing. The seedings were grown in a greenhouse under 25 °C and 16 h
photoperiod
condition.
The fresh leaves were collected from one-month old plants, the samples
were rapidly frozen in liquid nitrogen and stored at -70°C until use.
2.2
| Genome size estimation
The Sulv 1 genome size was calculated with the following formula: genome
size = total k-mer number/average k-mer depth, and total k-mer number is
the total number of k-mers from all reads. 350 bp insert size clean
reads were used to perform the k-mer (k = 19) analysis. A total of
54,027,628,001 k-mers were counted from these clean reads. A k-mer depth
distribution was obtained from paired end reads, and the peak depth was
clearly observed from the distribution data. Based on this distribution,
the size of the Sulv 1 genome was estimated to be ~
539.8 Mb and the heterozygosity was estimated to be 0.03% (Figure S1).
2.3
| Genome sequencing and assembly
Total DNA was isolated by using the CTAB method to construct Nanopore
and Illumina libraries. Libraries were generated and sequenced on the
PromethION sequencer platform (Oxford Nanopore Technologies, UK) at the
Biomarker Technologies Corporation (Beijing, China).
We first corrected the errors in the Nanopore sequencing reads with the
help of Canu (Koren et al. 2017) software. Based on this corrected data,
Sulv 1 genome was assembled using wtdbg2
(https://github.com/ruanjue/wtdbg2) software platform.Wtdbg2 chops reads
into 1024 bp segments, merges similar segments into a vertex and
connects vertices based on the segment adjacency on reads. The resulting
graph is called fuzzy Bruijn graph (FBG) which is similar to De Bruijn
graph but permits mismatches/gaps and keeps read paths when collapsing
k-mers. The draft genome was first calibrated using Racon (Vaser et al.
2017) with Nanopore reads through 3 rounds of calibration, and Pilon
(v1.21, Bruce et al. 2014) was then used to calibrate the draft genome
again with the help of short Illumina HiSeq reads in a 3 rounds of
calibration process too.
2.4
| Hi-C sequencing and chromosomes anchoring
We constructed Hi-C fragmented libraries (300-700 bp insert size) as
illustrated in Rao et al (Rao et al. 2014) and libraries were sequenced
through Illumina sequencing platform. Briefly, adapter sequences of raw
Hi-C reads were trimmed and low quality paired end reads were removed in
order to obtain clean data. The clean Hi-C reads were first truncated at
the putative Hi-C junctions and then the resulting trimmed reads were
aligned to the draft assembly with the help of bwa aligner (Li et al.
2013). Only uniquely aligned read pairs whose mapping quality was
greater than 20 were retained for further analysis. Invalid read pairs,
including Dangling-End and Self-cycle, Re-ligation and Dumped products,
were filtered by HiC-Pro (v2.8.1, Servant et al. 2015).
The uniquely mapped read pairs were valid interaction pairs and were
used for the correction of scaffolds and to order and orientate
scaffolds onto chromosomes by LACHESIS (Burton et al. 2013).
Before chromosomes assembly, we first performed a pre-assembly for the
error correction of scaffolds which required the splitting of scaffolds
into segments of 50 kb on average. The Hi-C data were mapped to these
segments using BWA (version 0.7.10-r789, Li et al. 2009) software. The
uniquely mapped data were retained to perform assembly by using LACHESIS
software. Any two segments which showed inconsistent connection with
information from the raw scaffolds were checked manually. These
corrected scaffolds were then assembled with LACHESIS. Parameters for
running LACHESIS software included: CLUSTER_MIN_RE_SITES=186,
CLUSTER_MAX_LINK_DENSITY=2, CLUSTER_NONINFORMATIVE_RATIO=1.3,
ORDER_MIN_N_RES_IN_TRUN=98, ORDER_MIN_N_RES_IN_SHREDS=100.
After this step, placement and orientation errors exhibiting obvious
discrete chromatin interaction patterns were manually adjusted. Finally,
470.45 Mb of the sequences (representing 99.3% total length) were
anchored to 11 chromosomes.
2.5
| Gene and repetitive sequence annotation
Repeats were masked on the assembled Sulv 1 genome using de novo-based
and homology-based strategies. We used RepeatMasker (Tarailo-Graovac et
al. 2009) for de novo repeat prediction based on a custom library
produced by RepeatModeler. Repbase (Jurka et al. 2005) was downloaded
from http://www.girinst.org/repbase/ and was used for homology-based
repeat detection. Repbase and the de novo repeat library were merged
together to annotate the repetitive elements in the assembled Sulv 1
genome by using RepeatMasker. Parameters for running RepeatMasker were:
“-nolow -no_is -norna -engine wublast”. And default parameters were
used for LTR_FINDER (Xu et al. 2007), RepeatScout (Price et al. 2005)
and PASTEClassifier (Hoede et al. 2014) softwares.
Protein-coding genes prediction of the assembled Sulv 1 genome was
conducted using three different strategies: ab initio prediction,
predictions based on homologous species, and based on unigenes. EVM
(v1.1.1, Haas et al. 2008) software was used to integrate these three
prediction results using default parameters. Softwares used for ab
initio prediction were Genscan (Burge et al. 1997), Augustus (v2.4,
Stanke et al. 2003), GlimmerHMM (v3.0.4, Majoros et al. 2004), GeneID
(v1.4, Blanco et al. 2007), SNAP (version 2006-07-28, Korf et al. 2004),
and default parameters were used. GeMoMa (v1.3.1, Keilwagen et al. 2016,
Keilwagen et al. 2018, with default parameters) was used for homology
based prediction with protein sequences from homologous species
including Arabidopsis thaliana , Vigna radiata , Vigna
unguiculata and Glycine max . For unigene based prediction, PASA
(v2.0.2, Campbell et al. 2006) software was used on the basis of
assembled RNA-seq unigenes with the following parameters:
“-align_tools gmap -maxIntronLen 20000”. Specifically, Hisat (v2.0.4,
Kim et al. 2015) and Stringtie (v1.2.3, Pertea et al. 2015) were used
for the assembly of transcripts;
TransDecoder
(v2.0, available online: https://transdecoder.github.io/) and
GeneMarkS-T (v5.1, Tang et al. 2015) were used for gene prediction.
Parameters used for Hisat software were: “–max-intronlen 20000
–min-intronlen 20” and default parameters were used for Stringtie,
TransDecoder and GeneMarkS-T.
For the annotation of noncoding RNAs, Blastn was used for genome-wide
comparison to identify microRNAs and rRNAs based on the Rfam
(Griffiths-Jones et al. 2005) database, and tRNAs were identified using
tRNAscan-SE (Lowe et al. 1997) sofaware.
Using the predicted protein sequences, BLAT (Kent et al. 2002) alignment
was conducted to find homologous gene sequences (possible genes) in the
assembled Sulv 1 genome, and then GeneWise (Birney et al. 2004) was used
to detect immature stop codons and frameshift mutations in the gene
sequences to identify pseudogenes with default parameters. E-value
cutoff for GenBlastA (She et al. 2009) was set to 1e-5.
The sequences of the predicted protein-coding genes were searched
against commonly used Nr (Marchler-Bauer et al. 2011), KOG (Koonin et
al. 2004), GO (Dimmer et al. 2012), KEGG (Kanehisa et al. 2000) and
TrEMBL (Boeckmann et al. 2003) databases for gene function annotation
with BLAST software (v2.2.31, Altschul et al. 1990, e-value cutoff
1e-5). Motif annotation was performed through comparison against PROSITE
(Bairoch et al. 1991), HAMAP (Lima et al. 2009), Pfam (Finn et al.
2006), PRINTS (Attwood et al. 1994), ProDom (Bru et al. 2005), SMART
(Letunic et al. 2004), TIGRFAMs (Haft et al. 2003), PIRSF (Wu et al.
2004), SUPERFAMILY (Gough et al. 2002), CATH-Gene3D (Lees et al. 2012)
and PANTHER (Thomas et al. 2003) databases using InterProScan (Zdobnov
et al. 2001) software.
2.6
| Gene family analysis
The protein sequences of Sulv 1 and 10 related species includingVigna radiata , cowpea, common bean, soybean, Vigna
angularis , Lablab purpureus , Medicago truncatula ,Lotus japonicus , Vigna subterranea and Arabidopsis
thaliana (downloaded from NCBI database) were used for gene family
clustering through OrthoMCL (Li et al. 2003) software with default
parameter settings.
2.7
| Phylogenetic tree reconstruction and divergence time
prediction
Single-copy orthologues identified from the analyzed genomes were used
for subsequent phylogenetic tree reconstruction and divergence time
evaluation. Multiple sequence alignment was performed using MUSCLE
(Edgar et al. 2004), and then a phylogenetic tree was constructed using
PHYML (Guindon et al. 2010) software based on the alignment. MCMCTREE
implemented in the PAML package (v4.7b, Yang et al. 1997) was used to
estimate the speciation time.