2.5 Genome annotation
Repetitive sequences clustered or dispersed among genes widely exist in eukaryotic genomes. According to their distribution, the repetitive sequences are divided into interspersed repeats and tandem repeats. The repetitive sequences of C. striatipennis were annotated mainly by de novo prediction and homology-based searches. RepeatMasker v4.1.0 and RepeatProteinMask v4.1.0 softwares were employed to recognize repetitive sequences referring to the RepBaseRepeatMaskerEdition-2018.10.26 database (Bao et al., 2015). For de novo prediction, a repetitive sequence database was built using RepeatModeler v2.0.2a (Price et al., 2005), then the protein-coding sequences were filtered using the BLAST in the repetitive sequence database (Kent, 2002). RepeatMasker software was subsequently applied to predict repeat sequences in the genome. LTR_finder and LTR_ retriever were used to predict long terminal repeated (LTR) and the Tandem repeats search was carried out by the Tandem Repeat Finder (TRF) (Benson, 1999; Xu & Wang, 2007).
The annotation of high-quality protein-coding genes was realized by integrating homology-based, de novo and transcriptome-based predictions. For homology-based prediction, protein sequences of five species (Anopheles gambiae , Culex quinquefasciatus ,Drosophil melanogaster , Musca domestica ,Polypedilum vanderplanki ) and the RNA-seq data of C. striatipennis which was assembled into full-length transcriptome through Trinity v2.8.5 software were merged to align and predict genome sequences via Maker v2.31.10 (Grabherr et al., 2011; Haas et al., 2013; Henschel et al., 2012; Holt & Yandell, 2011; Stanke et al., 2006). Then, the complete sequences of 15586 genes derived from homology prediction method were utilized to construct a hidden markov model through Augustus v3.4.0 and SNAP v2017-03-01. In this process, BUSCO was used to accelerate the model constructing of Augustus (Johnson et al., 2008; Manni et al., 2021; Stanke et al., 2006). Finally, Maker was applied to annotate and integrate the results generated by the above methods.
The predicted gene model was further functionally annotated by using protein database. Diamond was used to annotate the predicted protein coding genes by alignment to SwissProt (http://www.uniprot.org/) and NCBI Nr database (Buchfink et al., 2021; Buchfink et al., 2015). KAAS web server (https://www.genome.jp/kaas-bin/kaas_main) for KEGG annotations was accessed in March 2022. InterPro and GO (The Gene Ontology Consortium) databases were compared with the predicted protein coding genes by InterProScan(Jones et al., 2014).
For non-coding RNA annotations, tRNAscan-SE 2.0.9 was used to annotate tRNA sequences (Lowe & Eddy, 1997); Infernal v1.1.4 (http://infernal.janelia.org/) was used to predict others non-coding RNA through Rfam 14.7 (Nawrocki & Eddy, 2013).