2.7 Gene prediction and functional annotation
In order to identify the tandem repeats and transposable elements (TEs) in repetitive sequences, we combined the de novo and homolog-based methods. A de novo specific repeat library was first generated using the RepeatModeler v. 1.0.11 (Bedell et al.2000). The repetitive sequences in the assembled genome were annotated by the RepeatMasker v 4.0.6 with default parameters (Bedell et al. 2000). Afterwards, RepeatMasker, RepeatProteinMask and Tandem repeats finder (TRF, v 4.09) were used to search against the known RepBase repeats (Allred et al. 2008; Bedell et al. 2000; Benson 1999; Jurka et al.2005). In addition, the simple sequence repeats (SSRs) were identified as implemented in MIcroSAtellite Identification Tool (Thiel et al. 2003).
The non-coding RNAs (ncRNAs) were annotated using BLAST (E-value ≤ 1e−5) from the Rfam database (Camacho et al. 2009; Kalvariet al. 2018), including microRNAs (miRNAs), ribosomal RNAs (rRNAs), snRNAs and transfer RNAs (tRNAs). RNAmmer v1.2 was used to predict the rRNAs and their subunits (Lagesen et al. 2007). We also annotated the tRNAs by tRNAscan-SE v1.3.1 with default parameters (Lowe& Eddy 1997).
We combined homology searches, de novo prediction and transcriptome data-based approaches to predict protein-coding gene structures of S. peregrina . In the homology-based method, protein sequences from five dipteran insects (Aedes aegypti ,Anopheles gambiae , Drosophila melanogaster , Lucilia cuprina , and Musca domestica ) were used as queries to search against the assembled genome using the GeneWise v2.4.1 (Birney& Durbin 2000). The de novo predictions were performed from the homology-based predictions to train model parameters using the Augustus v3.0 (Stanke et al. 2004), SNAP (Korf 2004), GlimmerHMM (Majoroset al. 2004), and GeneID v1.4.4 (Bromberg& Rost 2007). Meanwhile, transcriptome data was utilized to align against the genome assembly through PASA and TopHat, respectively (Haas et al. 2008; Moriya et al. 2007). Subsequently, we integrated all predicted genes to generate a consensus gene set via EVidenceModeler v1.1.1 (Haaset al. 2008). The genes containing TEs were then abandoned using the TransposonPSI package to search against the Repbase (Yagi et al. 2013). Finally, all gene sets were predicted in assembled genome. Additionally, in order to annotate gene functions, the predicted genes were aligned against the NR, Swissprot, TrEMBL, KEGG, KOG, GO, Pfam and InterProscan databases.