Genomic similarity of SARS-CoV-2 to closely related species
To estimate the similarity of SARS-CoV-2 and its related species, we performed genome embedding using Doc2vec model with Gensim. All the genomes were embedded into 32-dimensional space and the distribution was shown with t-SNE (Figure 1). Overall, the genome embeddings were clustered by viral species except a few outliers of SARS coronaviruses. All the SARS-CoV-2 genome embeddings were clustered with no pattern of submitting date. Of note, although the BatCoV RaTG13 (MN996532.1) had high identity (96%) to SARS-CoV-2 reference genome (NC_045512.2) as some SARS-CoV-2 strains, it was not found in SARS-CoV-2 cluster or nearby. The result of genome embedings indicated that SARS-CoV-2 had the same distance to the related species, which was different from alignment-based inference and needed to be carefully investigated in the future.
Figure1