2.5 Genome annotation
Repetitive sequences clustered or dispersed among genes widely exist in
eukaryotic genomes. According to their distribution, the repetitive
sequences are divided into interspersed repeats and tandem repeats. The
repetitive sequences of C. striatipennis were annotated mainly by
de novo prediction and homology-based searches. RepeatMasker v4.1.0 and
RepeatProteinMask v4.1.0 softwares were employed to recognize repetitive
sequences referring to the RepBaseRepeatMaskerEdition-2018.10.26
database (Bao et al., 2015). For de novo prediction, a repetitive
sequence database was built using RepeatModeler v2.0.2a (Price et al.,
2005), then the protein-coding sequences were filtered using the BLAST
in the repetitive sequence database (Kent, 2002). RepeatMasker software
was subsequently applied to predict repeat sequences in the genome.
LTR_finder and LTR_ retriever were used to predict long terminal
repeated (LTR) and the Tandem repeats search was carried out by the
Tandem Repeat Finder (TRF) (Benson, 1999; Xu & Wang, 2007).
The annotation of high-quality protein-coding genes was realized by
integrating homology-based, de novo and transcriptome-based predictions.
For homology-based prediction, protein sequences of five species
(Anopheles gambiae , Culex quinquefasciatus ,Drosophil melanogaster , Musca domestica ,Polypedilum vanderplanki ) and the RNA-seq data of C.
striatipennis which was assembled into full-length transcriptome
through Trinity v2.8.5 software were merged to align and predict genome
sequences via Maker v2.31.10 (Grabherr et al., 2011; Haas et al., 2013;
Henschel et al., 2012; Holt & Yandell, 2011; Stanke et al., 2006).
Then, the complete sequences of 15586 genes derived from homology
prediction method were utilized to construct a hidden markov model
through Augustus v3.4.0 and SNAP v2017-03-01. In this process, BUSCO was
used to accelerate the model constructing of Augustus (Johnson et al.,
2008; Manni et al., 2021; Stanke et al., 2006). Finally, Maker was
applied to annotate and integrate the results generated by the above
methods.
The predicted gene model was further functionally annotated by using
protein database. Diamond was used to annotate the predicted protein
coding genes by alignment to SwissProt (http://www.uniprot.org/) and
NCBI Nr database (Buchfink et al., 2021; Buchfink et al., 2015). KAAS
web server (https://www.genome.jp/kaas-bin/kaas_main) for KEGG
annotations was accessed in March 2022. InterPro and GO (The Gene
Ontology Consortium) databases were compared with the predicted protein
coding genes by InterProScan(Jones et al., 2014).
For non-coding RNA annotations, tRNAscan-SE 2.0.9
was used to annotate
tRNA sequences (Lowe & Eddy, 1997); Infernal v1.1.4
(http://infernal.janelia.org/)
was used to predict others non-coding RNA through Rfam 14.7 (Nawrocki &
Eddy, 2013).