Authorea

Wen Jenny Shi added Framework.tex over 9 years ago

Commit id: 822d8f95853c81735e3141f085821a0a8abee9c9

deletions | additions

\section{Parametric Bayesian Mixture Framework}\label{Sec:method} \subsection{Sequencing Data} Advances in high-throughput whole genome shotgun sequencing allow deep genome sequencing of viral populations within a host \citep{Muers2011}. This technology produces millions of short deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequences. These sequences are then aligned to a reference genome and differences between the reference and sequenced population are noted. With this advanced shotgun sequencing method, we are able to combine the reads from each individual and work with data with the following form: \\ \begin{table} %\centering \resizebox{16cm}{!}{ \begin{tabular*}{0.5\textwidth}{l*{9}{c}} \hline $Y_1$&$Y_2replace_content$Y_3replace_content$Y_4replace_content$Y_5replace_content$Y_6replace_content$Y_7replace_content$Y_8replace_content$\cdots$\\ \hline && & & & C& A& T &$\cdots$\\ C &T& C& T& A& C& A& & $\cdots$\\ & & C& T& A& C& C& M& $\cdots$\\ & & & & G& C& T& T& $\cdots$\\ & C& M& G& T& C& T& & $\cdots$\\ & G& C& T& C& & & & $\cdots$\\ \hline \end{tabular*} \hspace{.2in} \quad $\Longrightarrow$ \hspace{.3in} $\begin{array}{cccccccccc} \hline &Y_1&Y_2&Y_3&Y_4&Y_5&Y_6&Y_7&Y_8&\cdots\\ \hline A& 0& 0& 0& 0& 2& 0& 2& 0&\cdots \\ C& 1& 1& 3& 0& 1& 5& 1& 0&\cdots\\ G& 0& 1& 0& 1& 1& 0& 0& 0&\cdots\\ T& 0& 1& 0& 3& 1& 0& 2& 2&\cdots\\ M& 0& 0& 1& 0& 0& 0& 0& 1&\cdots\\ \hline\\ \end{array}$ } \vspace{.2in} \caption{High through-put sequencing data from all samples are pooled and aligned (left panel) and then compressed into a five-row count matrix for the genome of interest (right panel).} \label{tab:seqdata} \end{table} Letters A, C, G, T, M each stands for Adenine, Cytosine, Guanine, Thymine, Missing/deleted data respectively. Left-hand side of Table ??? is a toy example of high through-put sequencing alignment result. Its read-specific compressed view is shown in the right panel of Table ???. Counts of each read type (A, C, G, T, M) at the $i^{th}$ position are recorded as $Y_i=(y_i^1,y_i^2,y_i^3,y_i^4,y_i^5)$. \subsection{Dirichlet Mixture Model}\label{Sec:model} To describe the genome site specific variation lying within a viral population we construct a parametric Bayesian mixture model base on observed nucleotide read counts. Given the probability parameters, the collection of different read counts at each genome site is assumed to follow a multinomial distribution. For an arbitrary $i^{th}$ position on the sequence, the probabilities of having each of A, C, G, T, M, are denoted as $P_{c_i}=(p_{c_i}^1,p_{c_i}^2,p_{c_i}^3,p_{c_i}^4,p_{c_i}^5)$, with the assumption that $\sum_{j=1}^5p_{c_i}^j=1$ and every $p_{c_i}$ lies between 0 and 1. We assume a finite collection of $K$ possible probability parameters, $\bp=\{P_1,\cdots, P_K\}$, each genome site could take on, i.e. every $P_{c_i}$ is a member of $\bp$. In other words, the subscript $c_i$ is an assignment indicator denoting which probability parameter in the set $\bp$ the $i^{th}$ sequence site is associated with, $c_i\in\{ 1,\cdots, K\}$. The number of elements in $\bp$, $K$, is the number of mixture components in the Bayesian mixture framework. Since many sites on the genome sequence share the same tendencies of having certain kinds of reads, it is natural to assume that $K$ is much smaller than the length of the viral sequence of interest, $N$. Furthermore, a weakly informative symmetric Dirichlet prior is applied to all the elements of $\bp$ to ensure probability properties of $P_k$'s. With total five possible read types, A, C, G, T, M, a corrected Perks prior, Dirichlet ($\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$,$\frac{1}{25}$) is chosen as the prior for the 5-dim multinomial parameters. This type of set-up was introduced by P. Wally in his imprecise Dirichlet model paper \citep{Walley1996}. The corrected Perks prior reduces the prior strength by a factor proportional to the number of categories of the multinomial to ensure that the Bayesian estimator is preferred to maximum likelihood estimator for the parameters \citep{deCampos2011}. With an additional assumption that there is an equal chance of getting any $P_k$ in $\bp$, we construct the following hierarchical Dirichlet mixture model: \begin{eqnarray*} Y_i|c_i,\bp&\stackrel{\textit{indep.}}{\sim}&\textit{Multinomial}\left(m_i;P_{c_i}\right)\\ c_i|\bp&\stackrel{\textit{iid}}{\sim}&\textit{Uniform Discrete}\left(\frac{1}{K}\right)\\ P_k&\stackrel{\textit{iid}}{\sim}&\textit{Dirichlet}\left( \frac{1}{25},\frac{1}{25},\frac{1}{25},\frac{1}{25},\frac{1}{25}\right) \end{eqnarray*} where $m_i$ indicates the total number of reads observed at the $i^{th}$ position, i.e. $\sum_{j=1}^5y_i^j=m_i$. Component number $K$ is some fixed unknown integer. Integrating the posterior density $\pi(c_1,\cdots, c_N,\bp|Y_1,\cdots,Y_N)$ over $\bp$, the marginal posterior for the assignments given reads on the sequences is \begin{equation} \pi(c_1,\cdots, c_N|Y_1,\cdots,Y_N)=\frac{1}{h(Y_1,\cdots,Y_N)}\prod_{k=1}^K\frac{\prod_{j=1}^5\Gamma\left(\sum_{i=1}^Ny_{i}^j\ind+\frac{1}{25} \right)}{\Gamma\left(\sum_{i=1}^Nm_i\ind+1 \right)}, \label{eq:postC} \end{equation} where $h(Y_1,\cdots,Y_N)$ is the normalizing constant. Furthermore, if both read counts and assignments are given for the entire sequence sample, we have \begin{equation} P_k|c_1,\cdots, c_N,Y_1,\cdots,Y_N\stackrel{\textit{indep.}}{\sim}\textit{Dirichlet}\left(\alpha^1_k,\alpha^2_k,\alpha^3_k,\alpha^4_k,\alpha^5_k \right) , \label{eq:postP} \end{equation} where $\alpha^j_k=\sum_{i=1}^Ny_{i}^j\ind+\frac{1}{25}$, for $j=1, 2, 3, 4, 5;$ and $k=1,2,\cdots, K.$ In the methodology section we will introduce a sequence of efficient Markov chain Monte Carlo (MCMC) procedures used to cluster the genome sequence positions and generate assignment labels $c_ireplace_content#x27;s for each viral genome site. Notice that the posterior distribution (???) is defined for a fixed mixture component number $K$. One may choose $K$ ad hoc, however, if the chosen $K$ is smaller than the real number of mixture components, true clustering can never be achieved; if the chosen $K$ is too large, the clustering procedure can be infeasible due to the high dimension natural of genome sequences. At every iteration of the MCMC updating step, one coordinate or a class of coordinates will be altered into one of the $K$ possible assignments. As $K$ increases, the probability of assigning the correct label to each position decreases. Equation (???) naturally places an AIC-like penalty on non-empty clusters. It encourages each empty groups by scaling the marginal posterior $\pi(c_1,\cdots, c_N|Y_1,\cdots,Y_N)$ by $\Gamma^5\left(\frac{1}{25}\right)\approx 8\times 10^6$. This shrinkage property allows our algorithm to start with a liberal upper bound of component number instead of the truth and naturally reduces it to a close upper bound of $K$. In Section 3 we will introduce a tree-like MCMC step that provides the liberal upper bound and a block-MCMC procedure that produces a reasonably close upper bound of $K$. \subsection{Hellinger Distance} Before diving into our five-step sequential clustering approach, let us recall an f-divergence, Hellinger distance, which we use to draw inferences. Hellinger distance, $H$, is a measure used to quantify the similarity between two probability distributions in probability and statistics. It was first introduced by Ernst Hellinger \citep{Hellinger1909}. Under Lebesgue measure, for two probability density functions $f$ and $g$, the squared Hellinger distance can be expressed as following: \begin{equation} H^2(f,g)=\frac{1}{2}\int\left(\sqrt{f(x)}-\sqrt{g(x)} \right)^2dx=1-\int\sqrt{f(x)g(x)}dx. \label{eq:H2} \end{equation} We prefer Hellinger distance over Kullback-Leibler (KL) divergence because it is a metric and its range is [0,1]. One can also use a symmetrised KL, such as Jensen-Shannon divergence instead. We will use Hellinger distance to compare two marginal posterior distributions of the probability parameters given all cluster assignments and every read count, $P_{k_1}|c_1,\cdots, c_N,Y_1,\cdots,Y_N$, and $P_{k_2}|c_1,\cdots, c_N,Y_1,\cdots,Y_N$. The realization of the distance measures how similar the two allelic positions or same allelic position at two different time points are. Apply the squared measure (???) to two marginal posteriors with form (???) , we have \begin{equation} H^2(P_{k_1},P_{k_2}|c_1,\cdots, c_N,Y_1,\cdots,Y_N)=1-\frac{B(\vec{\beta}_{k_1,k_2})}{\sqrt{B(\vec{\alpha}_{k_1})B(\vec{\alpha}_{k_2})}}, \label{eq:H2P} \end{equation} where \begin{eqnarray*} &&\vec{\alpha}_{i}=\left(\alpha^1_{i},\alpha^2_{i},\alpha^3_{i},\alpha^4_{i},\alpha^5_{i} \right), \textit{ for }i=k_1,k_2;\\ &&\vec{\beta}_{k_1,k_2}=\left(\frac{\alpha^1_{k_1}+\alpha^1_{k_2}}{2},\frac{\alpha^2_{k_1}+\alpha^2_{k_2}}{2}, \frac{\alpha^3_{k_1}+\alpha^3_{k_2}}{2},\frac{\alpha^4_{k_1}+\alpha^4_{k_2}}{2},\frac{\alpha^5_{k_1}+\alpha^5_{k_2}}{2}\right);\\ &&B(a^1,a^2,a^3,a^4,a^5)=\frac{\prod_{j=1}^5\Gamma(a^j)}{\Gamma\left(\sum_{j=1}^5a^j\right)}. \end{eqnarray*} To better visualize Helliginer distances for the viral data we apply a monotonic transformation on $H$: $f(H)=\ln\left(1-\ln\left(1-H^2 \right)\right)$. With the definition (???) the Hellinger distance can then be transformed into \begin{equation} H_t(P_{k_1},P_{k_2}|c_1,\cdots, c_N,Y_1,\cdots,Y_N)=\ln\left(1-\ln\left(\frac{B(\vec{\beta}_{k_1,k_2})}{\sqrt{B(\vec{\alpha}_{k_1})B(\vec{\alpha}_{k_2})}} \right)\right). \label{eq:Ht} \end{equation}