Introduction
SARS-CoV-2 (severe acute respiratory syndrome coronavirus-2) is a novel human infection coronavirus, which is responsible for the outbreak of coronavirus disease in December, 2019, in Wuhan, China and later all over the world [1]. SARS-CoV-2 belongs to beta-coronaviruses which also including bat coronavirus (BCoV) as well as SARS-CoV and MERS-CoV viruses [2]. Although the origin of SARS-CoV-2 is still unclear, a few closely related CoVs with high sequence identity were identified including the BatCoV RaTG13 (MN996532.1, identity: ~96%) [3]. The similarity analysis between SARS-CoV-2 and the animal-infection CoVs suggests its bat- or pangolin-origin [3,4]. However, the putative inter-species evolution and infection mechanism remains largely unknown.
SARS-CoV-2 contains a positive-sense, single-stranded RNA (ssRNA) genome of about 30 kb in size [1]. Besides the 5’- and 3’-untranslated region (UTR), almost all the genome is occupied by coding regions. The 5’-terminal encodes the largest polyprotein ORF1ab which involves in genome transcription and replication [5,6]. The glycoproteins spike (S) attaches the virus to the cell membrane by interacting with host receptor, initiating the infection [7-9]. The remaining ORFs encoding envelope (E), membrane (M), nucleocapsid (N) proteins as well as a few accessory proteins such as ORF3a, ORF8 [10].
The S protein is processed by host cell furin or another cellular protease to yield the mature S1 and S2 proteins [11,12]. The S1 fragment which contains RBD (receptor-binding domain) is responsible for receptor binding while a second cleavage of S2 leads to the release of a fusion peptide after viral attachment to host cell receptor. It was reported that the SARS-CoV-2 S binds human angiotensin-converting enzyme 2 (ACE2) with higher affinity than SARS-CoV spike protein [13]. Due to the great importance in determining the host specificity and infection efficiency, the coronavirus spike glycoprotein is the key target for vaccines, therapeutic antibodies and diagnostics [14,15].
With the increasing amount of sequencing data of SARS-CoV-2 deposited in public databases, the characteristics of genomic variance of different isolates are emerging [16-19]. Here, more than 17,000 SARS-CoV-2 complete genome sequences were analyzed to provide a landscape of mutations of this novel coronavirus. Furthermore, the genome embeddings method was used to infer the genome similarity, providing novel insights into the phylogenetic origin of SARS-CoV-2.