Discussion
With the increasing number of sequenced SARS-CoV-2 genomes, more and more mutations will be discovered. In this study, more than 17,000 complete SARS-CoV-2 genomes collected all over the world were analyzed to characterize the mutations on both nucleotide and protein levels. Except the deletion/insertion in the two ends of genome, a few frequent mutations were discovered. These mutations may result from the positive selection which should be carefully studied in the future. Also, the mutations may be used as marker to track the origin of different isolates and the conservative regions provide useful information to develop robust molecular diagnostics methods.
To investigate the phylogenetic of SARS-CoV-2, a Doc2vec model was used for embedding genome sequences. Doc2vec is an unsupervised learning algorithm, which is used to predict vectors to represent different documents and hence infer the similarity between them. It seems that the distance estimated from genome embedding is different from sequence alignment. Because of interspecies exchange of genetic fragments, the overall similarity of whole genomes may not sufficient to reveal the evolutionary relationships. The result of genome embedding should not be neglected, but need to be carefully investigated in the future.
Besides the genomic variation, the mutations of the encoded proteins of SARS-CoV-2 were also analyzed. Obviously, some proteins, including spike protein, showed less evolutional constrain and some frequent mutations were identified. Whether these mutations result from positive selection and the biological significance should be investigated in the future. The conservation and diversity of SARS-CoV-2 proteome will benefit discovering the infection mechanism and developing therapeutic methods.
For the most intensively studied S protein of coronavirus, all the mutations were analyzed. It seems that S protein is under fast evolution and the SBD domain is most susceptible to mutation. How these mutations determine the receptor specificity and affinity need further research. Identification of the S protein’s mutations will provide the basis for optimizing the design of diagnostic, antiviral and vaccination strategies for this emerging infection.