Authorea

Wen Jenny Shi edited sectionParametric_Ba.tex over 9 years ago

Commit id: cfa2ec518a89099c827d8d09c7fb1ef551510c7d

deletions | additions

\subsection{Dirichlet Mixture Model}\label{Sec:model} To describe the genome site specific variation lying within a viral population we construct a parametric Bayesian mixture model base on observed nucleotide read counts. Given the probability parameters, the collection of different read counts at each genome site is assumed to follow a multinomial distribution. For an arbitrary $i^{th}$ position on the sequence, the probabilities of having each of A, C, G, T, M, are denoted as $P_{c_i}=(p_{c_i}^1,p_{c_i}^2,p_{c_i}^3,p_{c_i}^4,p_{c_i}^5)$, with the assumption that $\sum_{j=1}^5p_{c_i}^j=1$ and every $p_{c_i}$ lies between 0 and 1. We assume a finite collection of $K$ possible probability parameters, $\bp=\{P_1,\cdots, P_K\}$, each genome site could take on, i.e. every $P_{c_i}$ is a member of $\bp$. In other words, the subscript $c_i$ is an assignment indicator denoting which probability parameter in the set $\bp$ the $i^{th}$ sequence site is associated with, $c_i\in\{ 1,\cdots, K\}$. The number of elements in $\bp$, $K$, is the number of mixture components in the Bayesian mixture framework. Since many sites on the genome sequence share the same tendencies of having certain kinds of reads, it is natural to assume that $K$ is much smaller than the length of the viral sequence of interest, $N$. Furthermore, a weakly informative symmetric Dirichlet prior is applied to all the elements of $\bp$ to ensure probability properties of $P_kreplace_content#x27;s. With total five possible read types, A, C, G, T, M, a corrected Perks prior, Dirichlet ($\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$) is chosen as the prior for the 5-dim multinomial parameters. This type of set-up was introduced by P. Wally in his imprecise Dirichlet model paper \citep{Walley1996}. The corrected Perks prior reduces the prior strength by a factor proportional to the number of categories of the multinomial to ensure that the Bayesian estimator is preferred to maximum likelihood estimator for the parameters \citep{DeCampos2011}. \citep{deCampos2011}. With an additional assumption that there is an equal chance of getting any $P_k$ in $\bp$, we construct the following hierarchical Dirichlet mixture model: \begin{eqnarray*} Y_i|c_i,\bp&\stackrel{\textit{indep.}}{\sim}&\textit{Multinomial}\left(m_i;P_{c_i}\right)\\ c_i|\bp&\stackrel{\textit{iid}}{\sim}&\textit{Uniform Discrete}\left(\frac{1}{K}\right)\\