2 | MATERIALS AND METHODS
2.1 | Plant materials
The pure line of an elite cultivar, Sulv 1 was chosen for genome sequencing. The seedings were grown in a greenhouse under 25 °C and 16 h photoperiod condition. The fresh leaves were collected from one-month old plants, the samples were rapidly frozen in liquid nitrogen and stored at -70°C until use.
2.2 | Genome size estimation
The Sulv 1 genome size was calculated with the following formula: genome size = total k-mer number/average k-mer depth, and total k-mer number is the total number of k-mers from all reads. 350 bp insert size clean reads were used to perform the k-mer (k = 19) analysis. A total of 54,027,628,001 k-mers were counted from these clean reads. A k-mer depth distribution was obtained from paired end reads, and the peak depth was clearly observed from the distribution data. Based on this distribution, the size of the Sulv 1 genome was estimated to be ~ 539.8 Mb and the heterozygosity was estimated to be 0.03% (Figure S1).
2.3 | Genome sequencing and assembly
Total DNA was isolated by using the CTAB method to construct Nanopore and Illumina libraries. Libraries were generated and sequenced on the PromethION sequencer platform (Oxford Nanopore Technologies, UK) at the Biomarker Technologies Corporation (Beijing, China).
We first corrected the errors in the Nanopore sequencing reads with the help of Canu (Koren et al. 2017) software. Based on this corrected data, Sulv 1 genome was assembled using wtdbg2 (https://github.com/ruanjue/wtdbg2) software platform.Wtdbg2 chops reads into 1024 bp segments, merges similar segments into a vertex and connects vertices based on the segment adjacency on reads. The resulting graph is called fuzzy Bruijn graph (FBG) which is similar to De Bruijn graph but permits mismatches/gaps and keeps read paths when collapsing k-mers. The draft genome was first calibrated using Racon (Vaser et al. 2017) with Nanopore reads through 3 rounds of calibration, and Pilon (v1.21, Bruce et al. 2014) was then used to calibrate the draft genome again with the help of short Illumina HiSeq reads in a 3 rounds of calibration process too.
2.4 | Hi-C sequencing and chromosomes anchoring
We constructed Hi-C fragmented libraries (300-700 bp insert size) as illustrated in Rao et al (Rao et al. 2014) and libraries were sequenced through Illumina sequencing platform. Briefly, adapter sequences of raw Hi-C reads were trimmed and low quality paired end reads were removed in order to obtain clean data. The clean Hi-C reads were first truncated at the putative Hi-C junctions and then the resulting trimmed reads were aligned to the draft assembly with the help of bwa aligner (Li et al. 2013). Only uniquely aligned read pairs whose mapping quality was greater than 20 were retained for further analysis. Invalid read pairs, including Dangling-End and Self-cycle, Re-ligation and Dumped products, were filtered by HiC-Pro (v2.8.1, Servant et al. 2015).
The uniquely mapped read pairs were valid interaction pairs and were used for the correction of scaffolds and to order and orientate scaffolds onto chromosomes by LACHESIS (Burton et al. 2013).
Before chromosomes assembly, we first performed a pre-assembly for the error correction of scaffolds which required the splitting of scaffolds into segments of 50 kb on average. The Hi-C data were mapped to these segments using BWA (version 0.7.10-r789, Li et al. 2009) software. The uniquely mapped data were retained to perform assembly by using LACHESIS software. Any two segments which showed inconsistent connection with information from the raw scaffolds were checked manually. These corrected scaffolds were then assembled with LACHESIS. Parameters for running LACHESIS software included: CLUSTER_MIN_RE_SITES=186, CLUSTER_MAX_LINK_DENSITY=2, CLUSTER_NONINFORMATIVE_RATIO=1.3, ORDER_MIN_N_RES_IN_TRUN=98, ORDER_MIN_N_RES_IN_SHREDS=100. After this step, placement and orientation errors exhibiting obvious discrete chromatin interaction patterns were manually adjusted. Finally, 470.45 Mb of the sequences (representing 99.3% total length) were anchored to 11 chromosomes.
2.5 | Gene and repetitive sequence annotation
Repeats were masked on the assembled Sulv 1 genome using de novo-based and homology-based strategies. We used RepeatMasker (Tarailo-Graovac et al. 2009) for de novo repeat prediction based on a custom library produced by RepeatModeler. Repbase (Jurka et al. 2005) was downloaded from http://www.girinst.org/repbase/ and was used for homology-based repeat detection. Repbase and the de novo repeat library were merged together to annotate the repetitive elements in the assembled Sulv 1 genome by using RepeatMasker. Parameters for running RepeatMasker were: “-nolow -no_is -norna -engine wublast”. And default parameters were used for LTR_FINDER (Xu et al. 2007), RepeatScout (Price et al. 2005) and PASTEClassifier (Hoede et al. 2014) softwares.
Protein-coding genes prediction of the assembled Sulv 1 genome was conducted using three different strategies: ab initio prediction, predictions based on homologous species, and based on unigenes. EVM (v1.1.1, Haas et al. 2008) software was used to integrate these three prediction results using default parameters. Softwares used for ab initio prediction were Genscan (Burge et al. 1997), Augustus (v2.4, Stanke et al. 2003), GlimmerHMM (v3.0.4, Majoros et al. 2004), GeneID (v1.4, Blanco et al. 2007), SNAP (version 2006-07-28, Korf et al. 2004), and default parameters were used. GeMoMa (v1.3.1, Keilwagen et al. 2016, Keilwagen et al. 2018, with default parameters) was used for homology based prediction with protein sequences from homologous species including Arabidopsis thaliana , Vigna radiata , Vigna unguiculata and Glycine max . For unigene based prediction, PASA (v2.0.2, Campbell et al. 2006) software was used on the basis of assembled RNA-seq unigenes with the following parameters: “-align_tools gmap -maxIntronLen 20000”. Specifically, Hisat (v2.0.4, Kim et al. 2015) and Stringtie (v1.2.3, Pertea et al. 2015) were used for the assembly of transcripts; TransDecoder (v2.0, available online: https://transdecoder.github.io/) and GeneMarkS-T (v5.1, Tang et al. 2015) were used for gene prediction. Parameters used for Hisat software were: “–max-intronlen 20000 –min-intronlen 20” and default parameters were used for Stringtie, TransDecoder and GeneMarkS-T.
For the annotation of noncoding RNAs, Blastn was used for genome-wide comparison to identify microRNAs and rRNAs based on the Rfam (Griffiths-Jones et al. 2005) database, and tRNAs were identified using tRNAscan-SE (Lowe et al. 1997) sofaware.
Using the predicted protein sequences, BLAT (Kent et al. 2002) alignment was conducted to find homologous gene sequences (possible genes) in the assembled Sulv 1 genome, and then GeneWise (Birney et al. 2004) was used to detect immature stop codons and frameshift mutations in the gene sequences to identify pseudogenes with default parameters. E-value cutoff for GenBlastA (She et al. 2009) was set to 1e-5.
The sequences of the predicted protein-coding genes were searched against commonly used Nr (Marchler-Bauer et al. 2011), KOG (Koonin et al. 2004), GO (Dimmer et al. 2012), KEGG (Kanehisa et al. 2000) and TrEMBL (Boeckmann et al. 2003) databases for gene function annotation with BLAST software (v2.2.31, Altschul et al. 1990, e-value cutoff 1e-5). Motif annotation was performed through comparison against PROSITE (Bairoch et al. 1991), HAMAP (Lima et al. 2009), Pfam (Finn et al. 2006), PRINTS (Attwood et al. 1994), ProDom (Bru et al. 2005), SMART (Letunic et al. 2004), TIGRFAMs (Haft et al. 2003), PIRSF (Wu et al. 2004), SUPERFAMILY (Gough et al. 2002), CATH-Gene3D (Lees et al. 2012) and PANTHER (Thomas et al. 2003) databases using InterProScan (Zdobnov et al. 2001) software.
2.6 | Gene family analysis
The protein sequences of Sulv 1 and 10 related species includingVigna radiata , cowpea, common bean, soybean, Vigna angularis , Lablab purpureus , Medicago truncatula ,Lotus japonicus , Vigna subterranea and Arabidopsis thaliana (downloaded from NCBI database) were used for gene family clustering through OrthoMCL (Li et al. 2003) software with default parameter settings.
2.7 | Phylogenetic tree reconstruction and divergence time prediction
Single-copy orthologues identified from the analyzed genomes were used for subsequent phylogenetic tree reconstruction and divergence time evaluation. Multiple sequence alignment was performed using MUSCLE (Edgar et al. 2004), and then a phylogenetic tree was constructed using PHYML (Guindon et al. 2010) software based on the alignment. MCMCTREE implemented in the PAML package (v4.7b, Yang et al. 1997) was used to estimate the speciation time.