Genome embedding
To characterize the genomic diversity during time, 1,928 SARS-CoV-2 isolates were sampled from the complete genomes with no more than 10 genomes on the same submitting date (Supplementary S3). Each genomic sequence was broken into a list of k-mers (k = 6) with overlapping and the step of one base from 5’ to 3’ end. To avoid the effect of gapped sequences, any fragment containing N (any base) was removed from the list. The genome embedding model learns to predict the central k-mer based on the whole genome embedding and the embeddings for a context window (size = 6) of k-mers on either side of the central k-mer. The embedding model was trained on 1,928 SARS-CoV-2 and 362 closely related virus genomes. After unsupervised training, the model was used to infer embeddings of genome sequences. All the genomes were embedded into the same vector space (32 dimensions), allowing comparison and inferring the distance between them. The training and inference of the embedding model was performed in Gensim (http://radimrehurek.com/gensim/tutorial.html) using Doc2Vec model (vector_size=32, min_count=3, window=6, epochs=30).
To visualize the distribution of the embeddings, two-dimensional t-distributed stochastic neighbor embedding (t-SNE) was then generated using a perplexity setting of 20, the learning rate of 200 and 5,000 iterations. The t-SNE was calculated and plotted with scikit-learn and matplotlib libraries of python, respectively.