Wen Jenny Shi edited sectionDiscussionlab.tex  over 9 years ago

Commit id: 018bd420d4df5e66946d76cd6e6030fb402b4b9d

deletions | additions      

       

We introduce a Dirichlet mixture model for detecting and clustering changes in allele frequencies in DNA or RNA sequence data from a population sampled at different time points. This annotation free approach is particularly useful for RNA viruses and other organisms where the secondary structure of the RNA can influence evolution in ways not predicted by standard analysis methods.   Io To  identify significant changes in allele frequency, our algorithm uses a combination of a hierarchical divisive clustering tree and a block Metropolis-Hasting. This approach does not require a prior distribution on the number of mixture components. It automatically produces both an appropriate upper bound for the cluster number (for the mixture components) and good initial states for the Gibbs sampler applied to the joint sequence. The hierarchical tree structure enables parallel computing and overcomes the computational difficulties any direct Markov chain Monte Carlo method presents. Our method outperforms direct Gibbs approaches with important additional benefits of avoiding using number of mixture components ad hoc and computational efficiency gained from parallel computing. The threshold for identifying substitution sites is derived based on the posterior distribution comparison for the time collections without treatment. It is chosen by examining the curvature in the graph of the number of members in the noise set instead of selecting an ad hoc cutoff.