Using Dirichlet mixture model to detect concomitant changes in allele frequencies

loading page

Wen Jenny Shi,
corbin jones,
Jan Hannig

Abstract

RNA viruses are challenging for protein and nucleotide sequence based methods of molecular evolutionary analysis because of their high mutation rates and complex secondary structures. With new DNA and RNA sequencing technologies, viral sequence data from both individuals and populations are becoming easier and cheaper to obtain. Thus, there is a critical need for methods that can identify alleles whose frequencies change over time or due to a treatment. We have developed a novel statistical approach for identifying evolved nucleotides and/or amino acids in a viral genome without relying on sequence annotation or the nature of the change. Instead it identfies nucleotides that have similar patterns of change. Our approach models allelic variances under a Bayesian Dirichlet mixture distribution. With a multi-stage clustering procedure we have developed an efficient clustering scheme that distinguishes treatment causal changes from variation within viral populations. Our method has been applied to a longitudinal time-sampled influenza A H1N1 virus strain in either the absence of presence of oseltamivir in replicated experiments. We find three genomic locations with strong evidence of treatment effect and a list of sites with high genetic variation in the untreated environment. We believe our approach can be broadly applied and is particularly useful for the cases that are recalcitrant to traditional sequence analysis.