Discussion
With the increasing number of sequenced SARS-CoV-2 genomes, more and
more mutations will be discovered. In this study, more than 17,000
complete SARS-CoV-2 genomes collected all over the world were analyzed
to characterize the mutations on both nucleotide and protein levels.
Except the deletion/insertion in the two ends of genome, a few frequent
mutations were discovered. These mutations may result from the positive
selection which should be carefully studied in the future. Also, the
mutations may be used as marker to track the origin of different
isolates and the conservative regions provide useful information to
develop robust molecular diagnostics methods.
To investigate the phylogenetic of SARS-CoV-2, a Doc2vec model was used
for embedding genome sequences. Doc2vec is an unsupervised learning
algorithm, which is used to predict vectors to represent different
documents and hence infer the similarity between them. It seems that the
distance estimated from genome embedding is different from sequence
alignment. Because of interspecies exchange of genetic fragments, the
overall similarity of whole genomes may not sufficient to reveal the
evolutionary relationships. The result of genome embedding should not be
neglected, but need to be carefully investigated in the future.
Besides the genomic variation, the mutations of the encoded proteins of
SARS-CoV-2 were also analyzed. Obviously, some proteins, including spike
protein, showed less evolutional constrain and some frequent mutations
were identified. Whether these mutations result from positive selection
and the biological significance should be investigated in the future.
The conservation and diversity of SARS-CoV-2 proteome will benefit
discovering the infection mechanism and developing therapeutic methods.
For the most intensively studied S protein of coronavirus, all the
mutations were analyzed. It seems that S protein is under fast evolution
and the SBD domain is most susceptible to mutation. How these mutations
determine the receptor specificity and affinity need further research.
Identification of the S protein’s mutations will provide the basis for
optimizing the design of diagnostic, antiviral and vaccination
strategies for this emerging infection.