2.7 Gene prediction and functional annotation
In order to identify the tandem repeats and transposable elements (TEs)
in repetitive sequences, we combined the de novo and
homolog-based methods. A de novo specific repeat library was
first generated using the RepeatModeler v. 1.0.11 (Bedell et al.2000). The repetitive sequences in the assembled genome were annotated
by the
RepeatMasker
v 4.0.6 with default parameters (Bedell et al. 2000). Afterwards,
RepeatMasker, RepeatProteinMask and Tandem repeats finder (TRF, v 4.09)
were used to search against the known RepBase repeats (Allred et
al. 2008; Bedell et al. 2000; Benson 1999; Jurka et al.2005). In addition, the simple sequence repeats (SSRs) were identified
as implemented in
MIcroSAtellite
Identification Tool (Thiel et al. 2003).
The non-coding RNAs (ncRNAs) were annotated using BLAST
(E-value
≤ 1e−5) from the Rfam database (Camacho et al. 2009; Kalvariet al. 2018), including microRNAs (miRNAs), ribosomal RNAs
(rRNAs), snRNAs and transfer RNAs (tRNAs). RNAmmer v1.2 was used to
predict the rRNAs and their subunits (Lagesen et al. 2007). We
also annotated the tRNAs by tRNAscan-SE v1.3.1 with default parameters
(Lowe& Eddy 1997).
We combined homology searches, de novo prediction and
transcriptome data-based approaches to predict protein-coding gene
structures of S. peregrina . In the homology-based method, protein
sequences from five dipteran insects (Aedes aegypti ,Anopheles gambiae , Drosophila melanogaster , Lucilia
cuprina , and Musca domestica ) were used as queries to search
against the assembled genome using the GeneWise v2.4.1 (Birney& Durbin
2000). The de novo predictions were performed from the
homology-based predictions to train model parameters using the Augustus
v3.0 (Stanke et al. 2004), SNAP (Korf 2004), GlimmerHMM (Majoroset al. 2004), and GeneID v1.4.4 (Bromberg& Rost 2007).
Meanwhile, transcriptome data was utilized to align against the genome
assembly through PASA and TopHat, respectively (Haas et al. 2008;
Moriya et al. 2007). Subsequently, we integrated all predicted
genes to generate a consensus gene set via EVidenceModeler v1.1.1 (Haaset al. 2008). The genes containing TEs were then abandoned using
the TransposonPSI package to search against the Repbase (Yagi et
al. 2013). Finally, all gene sets were predicted in assembled genome.
Additionally, in order to annotate gene functions, the predicted genes
were aligned against the NR, Swissprot, TrEMBL, KEGG, KOG, GO, Pfam and
InterProscan databases.