2.5 Genome annotation
We utilized de novo-, homology-, and transcriptome-based methods to predict protein-coding genes. Five tools employed were Genscan (verson3.1) (Burge & Karlin, 1997), Augustus (version 3.1) (Stanke & Waack, 2003), GlimmerHMM (version 3.0.4) (Majoros, Pertea, & Salzberg, 2004), GeneID (version 1.4) (Blanco, Parra, & Guigó, 2007), and SNAP (version 2006-07-28) (Korf, 2004); these were used for prediction de novo. Protein sequences from four representative species (Danio rerio , Crassostrea gigas , Crassostrea virginica , andMizuhopecten yessoensis ) were aligned to the Asian Clam protein sequences to perform homology-based prediction by GeMoMa (version 1.3.1) (Keilwagen et al., 2016). Transcriptome data were mapped to the genomic sequences; Hisat (version 2.0.4) (Kim, Langmead, & Salzberg, 2015) and Stringtie (version 1.2.3) (Pertea et al., 2015) were used to assemble and dissect functional genes. TransDecoder (version 2.0) (http://transdecoder.github.io) and GeneMarkS-T (version 5.1) (Tang, Lomsadze, & Borodovsky, 2015) were used for transcriptome-based prediction. Finally, the above methods were integrated into non-redundant protein-coding gene sets by EVM (version 1.1.1) (Haas et al., 2008) and PASA (version 2.0.2) (Haas et al., 2003).
The other genome features, including pseudogenes and non-coding RNAs, were identified by referring to the miRbase database (version 21.0) (Griffiths-Jones, Grocock, Van Dongen, Bateman, & Enright, 2006) and Rfam (version 13.0) (Daub, Eberhardt, Tate, & Burge, 2015). In the process of searching for putative pseudogenes, candidates were assessed based on the premature stop codons or frameshift mutations in the gene structure using GenBlastA (version 1.0.4) (She, et al., 2011). The identification of transfer RNA (tRNA) was performed by tRNAscan-SE (version 1.3.1) (Lowe & Eddy, 1997). MicroRNA and ribosomal RNA (rRNA) were identified by Infernal (version 1.1) (Nawrocki & Eddy, 2013).
The protein-coding genes were subject to functional annotation by aligning to the EuKaryotic Orthologous Groups (KOG) (Tatusov et al., 2003), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa & Goto, 2000), TrEMBL (Boeckmann et al., 2003), Swiss-Prot (Boeckmann et al., 2003), and Non-redundant (Nr) databases (Marchler et al., 2011) using BLAST (version 2.2.31) (Altschul, Gish, Miller, Myers, & Lipman, 1990) with a maximal E-value of 1e−05. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations and Gene ontology (GO) (Consortium, 2004) terms were assigned to identify gene functions using Blast2GO (version 4.1) (Conesa et al., 2005).