Genome embedding
To characterize the genomic diversity during time, 1,928 SARS-CoV-2
isolates were sampled from the complete genomes with no more than 10
genomes on the same submitting date (Supplementary S3). Each genomic
sequence was broken into a list of k-mers (k = 6) with overlapping and
the step of one base from 5’ to 3’ end. To avoid the effect of gapped
sequences, any fragment containing N (any base) was removed from the
list. The genome embedding model learns to predict the central k-mer
based on the whole genome embedding and the embeddings for a context
window (size = 6) of k-mers on either side of the central k-mer. The
embedding model was trained on 1,928 SARS-CoV-2 and 362 closely related
virus genomes. After unsupervised training, the model was used to infer
embeddings of genome sequences. All the genomes were embedded into the
same vector space (32 dimensions), allowing comparison and inferring the
distance between them. The training and inference of the embedding model
was performed in Gensim
(http://radimrehurek.com/gensim/tutorial.html)
using Doc2Vec model (vector_size=32, min_count=3, window=6,
epochs=30).
To visualize the distribution of the embeddings, two-dimensional
t-distributed stochastic neighbor embedding (t-SNE) was then generated
using a perplexity setting of 20, the learning rate of 200 and 5,000
iterations. The t-SNE was calculated and plotted with scikit-learn and
matplotlib libraries of python, respectively.