Introduction
While looking back to our history, outbreaks of infectious diseases always bring tragedy for humans. Even though we have been able to conquer most of them, several remain to circulate in human populations and emerge from time to time. Recent outbreak of novel coronavirus diseases 2019 (COVID-19)[1–3]caused by coronaviruses has made us more alert to the emergence of infectious diseases originating from animal reservoirs and transmitting between animals and people (so called zoonotic diseases). It is known that nearly two-thirds of emerging infectious diseases (EIDs) have their origins in animals[4–6]. In the U.S. zoonotic diseases of most concern include zoonotic influenza, salmonellosis, West Nile virus, plague, emerging coronaviruses, rabies, brucellosis and Lyme disease[4–6]. Other EIDs, such as human immunodeficiency virus type 1 (HIV-1) infections, Escherichia coli O157:H7, hantavirus, dengue fever and the Zika virus are also a significant burden on public health and global economies at present. Therefore, the way how to closely surveil and to efficiently control EIDs for pandemic prevention are urgent to acquire. In this review article, we will focus on EIDs caused by viruses.
Relative to DNA viruses, RNA viruses have high rates of mutation[7]due to, in part, the high error-prone and low-fidelity of the RNA-dependent RNA polymerases that replicate their genomes[8], subsequently attributing to viral sequence changes. These changes are somehow essential and necessary for viruses to maintain their fitness, especially allowing viruses to frequently undergo host switching under different selection pressures[9]. On a per-site level, it is known that virus sequence change is often dominated by synonymous nucleotide substitutions in coding regions; the protein sequence is thus unaffected[10]. In contrast, nonsynonymous substitutions that change protein sequences frequently result in changing physicochemical properties of amino acids, thereby bringing a much greater effect on an individual. Take severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as an example, the most common type of nonsynonymous mutation observed is alanine to valine (Ala → Val)[11]. Either way, substituted nucleotides that retrain after repeated circulation within a population are sort of imprints , reflecting how viruses adapt as a host niche changes throughout evolutionary timescales. It is important to stress the point that these changes in the viral sequence space can however be not straightforward; some substituted nucleotides may disappear during the period of this evolutionarily transient process; others may remain until the viruses can fully adapt to new hosts. Therefore, in the second part of this review we will focus on the potential of the molecular barcoding technology, a systematic and quantitative approach, with which we will be able to experimentally follow up sequential changes of the viral genomic sequences at a single-sequence level, being indispensable to dissect the molecular basis of any present and upcoming EID caused by emerging- and newly discovered viruses.
Molecular barcoding has been invented as a useful tool to investigate population diversity. The molecular barcoding strategy has first been proposed to solve the problems of PCR duplications and to improve the accuracy of next generation sequencing quantification[12–15]. In the past, molecular barcodes have been given various names, such as unique identifier, unique molecular identifier (UMI)[16], primer ID[17]and duplex barcodes. Molecular barcodes are commonly in the string form of random nucleotides, partially degenerate nucleotides, or defined nucleotides. The concept of molecular barcodes is that each original DNA or RNA fragment, within the same pool of the samples, is tagged with a unique sequence of molecular barcodes[18]. Sequence reads that contain different molecular barcodes illustrate different origins of molecules, whereas sequence reads with the same molecular barcodes are the result of PCR duplication from the same original molecule[18]. The length of molecular barcodes can vary (normally 4 - 20 base pairs): with a longer sequence of molecular barcodes we have a lower probability of identical barcodes present between two or more sequence reads. By employing molecular barcodes, we can thus possibly distinguish PCR artifacts from sequence variants present in different original molecules[13,19].
In the past ten years, technological progress of the molecular barcoding strategy has been made to reach the resolution at a single-molecule level[12,13,16,17,20–22]and detect low-frequency and subclonal variations[23]. This strategy can now be applied to study viruses in many aspects, for example, viral transmission[24], transcriptomics analyses of viruses[25], evolutionary dynamics[26], diagnostics of infectious diseases[27], viral capsid functions[28]as well as the analysis of a viral gene[17]. In this review we summarize several examples of studies, in which molecular barcodes are used to understand the molecular bases and viral fitness[29–31]of zoonotic viruses with the emphasis placed on SARS-CoV-2, HIV-1, influenza virus and Zika virus followed by elucidating our ideas about how molecular barcodes can be applied to closely survey and predict evolutionary of dynamic sequence changes of other emerging- and new discovered viruses in vitro .