Introduction
While looking back to our history, outbreaks of infectious diseases
always bring tragedy for humans. Even though we have been able to
conquer most of them, several remain to circulate in human populations
and emerge from time to time. Recent outbreak of novel coronavirus
diseases 2019
(COVID-19)[1–3]caused by coronaviruses has made us more alert to the emergence of
infectious diseases originating from animal reservoirs and transmitting
between animals and people (so called zoonotic diseases). It is known
that nearly two-thirds of emerging infectious diseases (EIDs) have their
origins in
animals[4–6].
In the U.S. zoonotic diseases of most concern include zoonotic
influenza, salmonellosis, West Nile virus, plague, emerging
coronaviruses, rabies, brucellosis and Lyme
disease[4–6].
Other EIDs, such as human immunodeficiency virus type 1 (HIV-1)
infections, Escherichia coli O157:H7, hantavirus, dengue fever
and the Zika virus are also a significant burden on public health and
global economies at present. Therefore, the way how to closely surveil
and to efficiently control EIDs for pandemic prevention are urgent to
acquire. In this review article, we will focus on EIDs caused by
viruses.
Relative to DNA viruses, RNA viruses have high rates of
mutation[7]due to, in part, the high error-prone and low-fidelity of the
RNA-dependent RNA polymerases that replicate their
genomes[8],
subsequently attributing to viral sequence changes. These changes are
somehow essential and necessary for viruses to maintain their fitness,
especially allowing viruses to frequently undergo host switching under
different selection
pressures[9].
On a per-site level, it is known that virus sequence change is often
dominated by synonymous nucleotide substitutions in coding regions; the
protein sequence is thus
unaffected[10].
In contrast, nonsynonymous substitutions that change protein sequences
frequently result in changing physicochemical properties of amino acids,
thereby bringing a much greater effect on an individual. Take severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as an example, the
most common type of nonsynonymous mutation observed is alanine to valine
(Ala →
Val)[11].
Either way, substituted nucleotides that retrain after repeated
circulation within a population are sort of imprints , reflecting
how viruses adapt as a host niche changes throughout evolutionary
timescales. It is important to stress the point that these changes in
the viral sequence space can however be not straightforward; some
substituted nucleotides may disappear during the period of this
evolutionarily transient process; others may remain until the viruses
can fully adapt to new hosts. Therefore, in the second part of this
review we will focus on the potential of the molecular barcoding
technology, a systematic and quantitative approach, with which we will
be able to experimentally follow up sequential changes of the viral
genomic sequences at a single-sequence level, being indispensable to
dissect the molecular basis of any present and upcoming EID caused by
emerging- and newly discovered viruses.
Molecular barcoding has been invented as a useful tool to investigate
population diversity. The molecular barcoding strategy has first been
proposed to solve the problems of PCR duplications and to improve the
accuracy of next generation sequencing
quantification[12–15].
In the past, molecular barcodes have been given various names, such as
unique identifier, unique molecular identifier
(UMI)[16],
primer
ID[17]and duplex barcodes. Molecular barcodes are commonly in the string form
of random nucleotides, partially degenerate nucleotides, or defined
nucleotides. The concept of molecular barcodes is that each original DNA
or RNA fragment, within the same pool of the samples, is tagged with a
unique sequence of molecular
barcodes[18].
Sequence reads that contain different molecular barcodes illustrate
different origins of molecules, whereas sequence reads with the same
molecular barcodes are the result of PCR duplication from the same
original
molecule[18].
The length of molecular barcodes can vary (normally 4 - 20 base pairs):
with a longer sequence of molecular barcodes we have a lower probability
of identical barcodes present between two or more sequence reads. By
employing molecular barcodes, we can thus possibly distinguish PCR
artifacts from sequence variants present in different original
molecules[13,19].
In the past ten years, technological progress of the molecular barcoding
strategy has been made to reach the resolution at a single-molecule
level[12,13,16,17,20–22]and detect low-frequency and subclonal
variations[23].
This strategy can now be applied to study viruses in many aspects, for
example, viral
transmission[24],
transcriptomics analyses of
viruses[25],
evolutionary
dynamics[26],
diagnostics of infectious
diseases[27],
viral capsid
functions[28]as well as the analysis of a viral
gene[17].
In this review we summarize several examples of studies, in which
molecular barcodes are used to understand the molecular bases and viral
fitness[29–31]of zoonotic viruses with the emphasis placed on SARS-CoV-2, HIV-1,
influenza virus and Zika virus followed by elucidating our ideas about
how molecular barcodes can be applied to closely survey and predict
evolutionary of dynamic sequence changes of other emerging- and new
discovered viruses in vitro .