Genome annotation
Whole gene sequencing showed that the genome size of strain GDHYZ30 is
4,785,117 bp, with 4,398 protein coding genes and the genome’s GC
content is 62.67% and the longest protein coding gene is 11,694 bp. The
average length of genes encoding proteins is 957bp, the proportion of
genes encoding proteins was 88.04% (Fig.3A&Table 1). After comparing
the reference sequences using BLAST x against NR (NCBI non-redundant
protein sequences), Swiss-Prot, COG (Clusters of Orthologous Groups of
proteins, and KEGG (Kyoto Encyclopedia of Genes and Genomes) databases,
we identified 4220 unigenes providing a significant result in NR, and
3179, 3346, and 2145 unigenes were annotated according to Swiss-Prot,
COG, and KEGG databases, respectively (Fig.3B). By aligning with the NR
library, it is possible to view the approximation of the transcript
sequence of the species and the similar species, as well as the
functional information of the homologous sequence. GDHYZ30 strain had
the most sequences aligned with the Chromobacterium haemolyticumstrain with 3059 sequences (Fig.3C). Functional prediction and
classification of unigenes was performed by comparing sequence data
against the COG database. A total of 1935 unigenes were annotated and
grouped into14 categories according to COG function classifications.
Among them, the top 3 clusters for general function prediction were
“amino acid transport and metabolism” (330 genes), “transcription”
(262 genes); , and“Energy production and conversion” (191 genes)
(Fig.3D).