Authorea

Wen Jenny Shi edited sectionDiscussionlab.tex over 9 years ago

Commit id: d039a1f915ee532bd6469da5e08ca37d2e3fa08f

deletions | additions

\section{Discussion}\label{Sec:discussion} In this manuscript we introduce a Dirichlet mixture model for detecting and clustering changes in allele frequencies in DNA or RNA samples drawn from a population sampled at different time points.Our method does not require a prior distribution on the number of mixture components. This annotation free approach is particularly useful for RNA viruses and other organisms where the secondary structure of the RNA can constrain evolution in ways not predicted by the genetic code.Using a combination of hierarchical divisive clustering tree and block Metropolis-Hasting, our method automatically produces both an appropriate cluster number upper bound for the mixture components and good initial states for the Gibbs sampler performed on the joint sequence. The hierarchical tree structure enables parallel computing and overcomes the computational difficulties any direct Markov chain Monte Carlo method presents. Our method outperforms direct Gibbs approaches with important additional benefits of avoiding using number of mixture components ad hoc and computational efficiency gained from parallel computing. The threshold for identifying substitution sites is derived based on the posterior distribution comparison for the time collections without treatment. It is chosen by examining the trajectory of the size of noise set instead of picking a cutoff ad hoc. Without requiring a prior distribution on the number of mixture components, our method uses a combination of hierarchical divisive clustering tree and block Metropolis-Hasting. It automatically produces both an appropriate cluster number upper bound for the mixture components and good initial states for the Gibbs sampler performed on the joint sequence. The hierarchical tree structure enables parallel computing and overcomes the computational difficulties any direct Markov chain Monte Carlo method presents. We apply our algorithm to simulated data and a well described HIV dataset as a proof Our method outperforms direct Gibbs approaches with important additional benefits of avoiding using number of concept. With minimum assumptions mixture components ad hoc and computational efficiency gained from parallel computing. The threshold for identifying substitution sites is derived based on gene annotation, we have successfully identified known drug resistance allele reported the posterior distribution comparison for the time collections without treatment. It is chosen by examining the curvature in previous literature \citep{Jabara2011} and a list the graph of sites with high variation within untreated population. the number of members in the noise set instead of selecting an ad hoc cutoff. For the H1N1 virus dataset, We apply our algorithm to simulated datasets. Results!(1) As a positive control, we also apply our method is implemented to a well described HIV dataset. With minimum assumptions on each of the eight segments gene annotation, we have successfully identified known drug resistance alleles reported in previous literature \citep{Jabara2011} and a list of IVA to reduce computational intensity. sites with high variation within untreated population. The main application that motivated this work is an H1N1 dataset. Analyzing multiple time points and treatment-control simultaneously, we identify three sites with strong evidence of treatment effect and six locations with high variability not due to the inhibitor. In addition, we We compare our method to a previous analysis of the same dataset based on a population geneticbased approach. Noticing that most of the sites identified using the latter method only appear in the biological replicate with larger sample size, we suspect that the population genetic based approach is biased by the unbalanced experimental design. Since our due to this inbalance. Our algorithm performs analysis on each biological replicate individually first and then aggregate the results, results across replicates. Therefore, our inference technique is not sensitive to the unbalanced nature of the experiments. data.