Authorea

Wen Jenny Shi edited sectionParametric_Ba.tex over 9 years ago

Commit id: 454f1e72d437d5e1fba4283858ac387c054d4165

deletions | additions

\resizebox{16cm}{!}{ \begin{tabular*}{0.5\textwidth}{l*{9}{c}} \hline $Y_1$&$Y_2$&$Y_3$&$Y_4$&$Y_5$&$Y_6$&$Y_7$&$Y_8$&$\cdots$\\ $Y_1$&$Y_2replace_content$Y_3replace_content$Y_4replace_content$Y_5replace_content$Y_6replace_content$Y_7replace_content$Y_8replace_content$\cdots$\\ \hline && & & & C& A& T &$\cdots$\\ C &T& C& T& A& C& A& & $\cdots$\\

\subsection{Dirichlet Mixture Model}\label{Sec:model} To describe the genome site specific variation lying within a viral population we construct a parametric Bayesian mixture model base on observed nucleotide read counts. Given the probability parameters, the collection of different read counts at each genome site is assumed to follow a multinomial distribution. For an arbitrary $i^{th}$ position on the sequence, the probabilities of having each of A, C, G, T, M, are denoted as $P_{c_i}=(p_{c_i}^1,p_{c_i}^2,p_{c_i}^3,p_{c_i}^4,p_{c_i}^5)$, with the assumption that $\sum_{j=1}^5p_{c_i}^j=1$ and every $p_{c_i}$ lies between 0 and 1. We assume a finite collection of $K$ possible probability parameters, $\bp=\{P_1,\cdots, P_K\}$, each genome site could take on, i.e. every $P_{c_i}$ is a member of $\bp$. In other words, the subscript $c_i$ is an assignment indicator denoting which probability parameter in the set $\bp$ the $i^{th}$ sequence site is associated with, $c_i\in\{ 1,\cdots, K\}$. The number of elements in $\bp$, $K$, is the number of mixture components in the Bayesian mixture framework. Since many sites on the genome sequence share the same tendencies of having certain kinds of reads, it is natural to assume that $K$ is much smaller than the length of the viral sequence of interest, $N$. Furthermore, a weakly informative symmetric Dirichlet prior is applied to all the elements of $\bp$ to ensure probability properties of $P_k$'s. $P_kreplace_content#x27;s. With total five possible read types, A, C, G, T, M, a corrected Perks prior, Dirichlet ($\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$) is chosen as the prior for the 5-dim multinomial parameters. This type of set-up was introduced by P. Wally in his imprecise Dirichlet model paper \citep{Walley1996}. The corrected Perks prior reduces the prior strength by a factor proportional to the number of categories of the multinomial to ensure that the Bayesian estimator is preferred to maximum likelihood estimator for the parameters.[REF!] With an additional assumption that there is an equal chance of getting any $P_k$ in $\bp$, we construct the following hierarchical Dirichlet mixture model: \begin{eqnarray*} Y_i|c_i,\bp&\stackrel{\textit{indep.}}{\sim}&\textit{Multinomial}\left(m_i;P_{c_i}\right)\\ c_i|\bp&\stackrel{\textit{iid}}{\sim}&\textit{Uniform Discrete}\left(\frac{1}{K}\right)\\

where $\alpha^j_k=\sum_{i=1}^Ny_{i}^j\ind+\frac{1}{25}$, for $j=1, 2, 3, 4, 5;$ and $k=1,2,\cdots, K.$ In the methodology section we will introduce a sequence of efficient Markov chain Monte Carlo (MCMC) procedures used to cluster the genome sequence positions and generate assignment labels $c_i$'s $c_ireplace_content#x27;s for each viral genome site. Notice that the posterior distribution (\ref{eq:postC}) is defined for a fixed mixture component number $K$. One may choose $K$ ad hoc, however, if the chosen $K$ is smaller than the real number of mixture components, true clustering can never be achieved; if the chosen $K$ is too large, the clustering procedure can be infeasible due to the high dimension natural of genome sequences. At every iteration of the MCMC updating step, one coordinate or a class of coordinates will be altered into one of the $K$ possible assignments. As $K$ increases, the probability of assigning the correct label to each position decreases. Equation (\ref{eq:postC}) naturally places an AIC-like penalty on non-empty clusters. It encourages each empty groups by scaling the marginal posterior $\pi(c_1,\cdots, c_N|Y_1,\cdots,Y_N)$ by $\Gamma^5\left(\frac{1}{25}\right)\approx 8\times 10^6$. This shrinkage property allows our algorithm to start with a liberal upper bound of component number instead of the truth and naturally reduces it to a close upper bound of $K$. In Section 3 we will introduce a tree-like MCMC step that provides the liberal upper bound and a block-MCMC procedure that produces a reasonably close upper bound of $K$. \subsection{Hellinger Distance}