Genome annotation
Whole gene sequencing showed that the genome size of strain GDHYZ30 is 4,785,117 bp, with 4,398 protein coding genes and the genome’s GC content is 62.67% and the longest protein coding gene is 11,694 bp. The average length of genes encoding proteins is 957bp, the proportion of genes encoding proteins was 88.04% (Fig.3A&Table 1). After comparing the reference sequences using BLAST x against NR (NCBI non-redundant protein sequences), Swiss-Prot, COG (Clusters of Orthologous Groups of proteins, and KEGG (Kyoto Encyclopedia of Genes and Genomes) databases, we identified 4220 unigenes providing a significant result in NR, and 3179, 3346, and 2145 unigenes were annotated according to Swiss-Prot, COG, and KEGG databases, respectively (Fig.3B). By aligning with the NR library, it is possible to view the approximation of the transcript sequence of the species and the similar species, as well as the functional information of the homologous sequence. GDHYZ30 strain had the most sequences aligned with the Chromobacterium haemolyticumstrain with 3059 sequences (Fig.3C). Functional prediction and classification of unigenes was performed by comparing sequence data against the COG database. A total of 1935 unigenes were annotated and grouped into14 categories according to COG function classifications. Among them, the top 3 clusters for general function prediction were “amino acid transport and metabolism” (330 genes), “transcription” (262 genes); , and“Energy production and conversion” (191 genes) (Fig.3D).