Genomic similarity of SARS-CoV-2 to closely related species
To estimate the similarity of SARS-CoV-2 and its related species, we
performed genome embedding using Doc2vec model with Gensim. All the
genomes were embedded into 32-dimensional space and the distribution was
shown with t-SNE (Figure 1). Overall, the genome embeddings were
clustered by viral species except a few outliers of SARS coronaviruses.
All the SARS-CoV-2 genome embeddings were clustered with no pattern of
submitting date. Of note, although the BatCoV RaTG13 (MN996532.1) had
high identity (96%) to SARS-CoV-2 reference genome (NC_045512.2) as
some SARS-CoV-2 strains, it was not found in SARS-CoV-2 cluster or
nearby. The result of genome embedings indicated that SARS-CoV-2 had the
same distance to the related species, which was different from
alignment-based inference and needed to be carefully investigated in the
future.
Figure1