Introduction
SARS-CoV-2 (severe acute respiratory syndrome coronavirus-2) is a novel
human infection coronavirus, which is responsible for the outbreak of
coronavirus disease in December, 2019, in Wuhan, China and later all
over the world [1]. SARS-CoV-2 belongs
to beta-coronaviruses which also including bat coronavirus (BCoV) as
well as SARS-CoV and MERS-CoV viruses
[2]. Although the origin of SARS-CoV-2
is still unclear, a few closely related CoVs with high sequence identity
were identified including the BatCoV RaTG13 (MN996532.1, identity:
~96%) [3]. The
similarity analysis between SARS-CoV-2 and the animal-infection CoVs
suggests its bat- or pangolin-origin
[3,4].
However, the putative inter-species evolution and infection mechanism
remains largely unknown.
SARS-CoV-2 contains a positive-sense, single-stranded RNA (ssRNA) genome
of about 30 kb in size [1]. Besides
the 5’- and 3’-untranslated region (UTR), almost all the genome is
occupied by coding regions. The 5’-terminal encodes the largest
polyprotein ORF1ab which involves in genome transcription and
replication
[5,6].
The glycoproteins spike (S) attaches the virus to the cell membrane by
interacting with host receptor, initiating the infection
[7-9]. The remaining ORFs encoding
envelope (E), membrane (M), nucleocapsid (N) proteins as well as a few
accessory proteins such as ORF3a, ORF8
[10].
The S protein is processed by host cell furin or another cellular
protease to yield the mature S1 and S2 proteins
[11,12].
The S1 fragment which contains RBD (receptor-binding domain) is
responsible for receptor binding while a second cleavage of S2 leads to
the release of a fusion peptide after viral attachment to host cell
receptor. It was reported that the SARS-CoV-2 S binds human
angiotensin-converting enzyme 2 (ACE2) with higher affinity than
SARS-CoV spike protein [13]. Due to
the great importance in determining the host specificity and infection
efficiency, the coronavirus spike glycoprotein is the key target for
vaccines, therapeutic antibodies and diagnostics
[14,15].
With the increasing amount of sequencing data of SARS-CoV-2 deposited in
public databases, the characteristics of genomic variance of different
isolates are emerging [16-19]. Here,
more than 17,000 SARS-CoV-2 complete genome sequences were analyzed to
provide a landscape of mutations of this novel coronavirus. Furthermore,
the genome embeddings method was used to infer the genome similarity,
providing novel insights into the phylogenetic origin of SARS-CoV-2.