Using Dirichlet mixture model to detect concomitant changes in allele frequencies


RNA viruses are challenging for protein and nucleotide sequence based methods of molecular evolutionary analysis because of their high mutation rates and complex secondary structures. With new DNA and RNA sequencing technologies, viral sequence data from both individuals and populations are becoming easier and cheaper to obtain. Thus, there is a critical need for methods that can identify alleles whose frequencies change over time or due to a treatment. We have developed a novel statistical approach for identifying evolved nucleotides and/or amino acids in a viral genome without relying on sequence annotation or the nature of the change. Instead it identfies nucleotides that have similar patterns of change. Our approach models allelic variances under a Bayesian Dirichlet mixture distribution. With a multi-stage clustering procedure we have developed an efficient clustering scheme that distinguishes treatment causal changes from variation within viral populations. Our method has been applied to a longitudinal time-sampled influenza A H1N1 virus strain in either the absence of presence of oseltamivir in replicated experiments. We find three genomic locations with strong evidence of treatment effect and a list of sites with high genetic variation in the untreated environment. We believe our approach can be broadly applied and is particularly useful for the cases that are recalcitrant to traditional sequence analysis.


RNA viruses and retroviruses, such as SARS, influenza, hepatitis C, polio, and HIV, use RNA as their genetic material. Thje RNA polymerases of these viruses lack the proof-reading ability of DNA polymerases, which results in a high mutation rate in these RNA genomes. This rapid rate of genome evolution can be advatageous for the virus as it can confound the immune system and medical treatment.

The rapid rate of sequence evolution also confounds efforts to tease apart the evolutionary signal of adaptation (such as evolution of drug resistance) from neutral evolutionary processes such as genetic drift. Phylogenetic and molecular evolutionary analyses of viral genes and genomes have become standard tools for investigating RNA virus evolution at a molecular level (Norstr***INVALID BYTE SEQUENCE HERE***m et al., 2012). However, the high mutation rate and the complex secondary structures of RNA viruses genomes often compromise sequence based methods of analysis (Simmonds et al., 1999; Damgaard et al., 2004; Watts et al., 2009; Cuevas et al., 2012). Adaptation by the virus to a host or to a drug treatment further complicates sequence analysis because compensatory mutations that offset structural defects and other pleiotropic costs of adaptive alleles often arise and sweep to fixation in viral populations (Knies et al., 2008). Thus there is a critical need for analytical methods that identify regions of the viral genome that have changed over time and are robust to these complications by making minimal assumptions about how the virus should evolve.

This need for new annotation free analytical tools has been amplified by the wealth of new viral sequence data made possible by recent advances in sequencing technology (Jabara et al., 2011). Increasingly, populations of thousands of viruses are sampled and sequenced from an infected individual, this approach captures a snapshot of the viral genetic variation within an individual. A couple studies have combined this approach with traditional passage experiments or sampling during the course of an infection (Eriksson et al., 2008; Kuroda et al., 2010; Leitner et al., 1993; Wright et al., 2010). This powerful approach reveals how genomically a population of viruses responds to evolutionary pressure. With the ever-decreasing cost of sequencing, these studies are expected to become commonplace.

Our motivating dataset is the influenza A H1N1 strain time series sample in the presence and absence of an inhibitor of neuraminidase, oseltamivir, collected from multiple passages with total two biological duplicates. We are interested in finding the genomic regions of the virus that are affected by the inhibitor. The influenza A virus (IVA) was first adapted from chicken egg to Madin-Darby canine kidney (MDCK) cells for three passages. Then the samples were serially passaged in MDCK cells in either the absence or presence of oseltamivir in replicated experiments (Figure \ref{fig:flow_flu1}). At the end of each passage, whole-genome high throughput sequencing data were collected. The size of the oval roughly corresponds to the average total read count per site.