2.5 | Protein-coding genes prediction and function annotation
A combined strategy of de novo gene prediction, homology-based search and RNA sequencing-aided annotation were used to perform gene prediction. For homology-based annotation, we selected the protein-coding sequences of five homologous species (Brugia malayi , C. elegans , Pristionchus pacificus ,Steinernema carpocapsae and T. canis ) from NCBI (https://www.ncbi.nlm.nih.gov/). For RNA-based prediction, a male and a female transcriptome sequence was aligned to the genome for assembly using TopHat (v2.1.0)(Trapnell, Pachter, & Salzberg) plus Trinity (v2.0.6)(Haas et al.) strategy. PASApipeline (v.2.1.0) was applied to predict gene structure after which the inferred gene structures were used in AUGUSTUS (v.3.2.3)(Mario et al., 2006) to train gene models based on transcript evidence. In addition, genome sequence was analyzed by the program GeneMark (v1.0)(John & Mark, 2005) utilizing unsupervised training to build a hidden Markov model. The consistent gene sets were generated by combining all above evidence using MAKER (v.2.31.8)(Campbell, Law, Holt, Stein, & Yandell, 2013). All gene evidence was merged to form a comprehensive and non-redundant gene set using EvidenceModeler (v1.1.1, EVM)(Haas et al., 2008).
In order to perform gene functional annotation, we aligned above gene sets against several known databases, including SwissProt, TrEMBL, KEGG, COG andNR. GO information was obtained through Blast2go (v.2.5.0)(Conesa et al., 2005). Furthermore, the mitochondrial genome was assembled by blasting with B. schroederi ’s mtDNA sequence from NCBI database(NC_015927.1)(Xie et al., 2011). The mitochondrial genome was annotated on GeSeq online (https://chlorobox.mpimp-golm.mpg.de/geseq.html) using homologous gene alignment(Michael et al., 2017). Four types of Non-coding RNA (ncRNA; including tRNA, snRNA, miRNA, and rRNA) were predicted. tRNAscan-SE (v1.3.1)(Lowe & Eddy, 1997) were used to predict tRNAs. We aligned B. schroederi genome against Rfam (v12.0)(Kalvari et al., 2018) database and invertebrate rRNA database to predict snRNA, miRNA and rRNA, respectively.