Graham McVicker edited Modeling read depths.tex  almost 10 years ago

Commit id: 163e41be92958453f0cf747df4714772a594eebb

deletions | additions      

       

\subsection{Modeling the read depths}  The number of reads mapping to a target region is often modelled using a poisson distribution \cite{Marioni_2008}. However, the poisson assumption that the variance is equal to the mean is violated because read counts from target regions are overdispersed. Part of this overdispersion can be accommodated by modelling the data with a negative-binomial distribution with a variance parameter, $\eta_h$, for each test, $h$ \cite{Anders_2010}. However, the negative binomial distribution assumes that the mean and variance have a quadratic relationship that is consistent across individuals. We have found that this assumption is violated by sequencing data and causes poor calibration of the tests, particularly when sample sizes are small. The CHT therefore includes an additional overdispersion parameter for each individual, $\Phi_i$, which is fit across the genome. After adding this additional dispersion parameter, the data are modelled with a beta-negative-binomial (BNB) distribution. The expected number of counts, $\lambda_{hi}$, is calculated based on $\alpha_h$, $\beta_h$, and the genotype genotype, $G_{im}$  for individual $i$ at test SNP $m$. The estimate is scaled by the total number of mapped reads for individual $i$, $T_i$. \[  \lambda_i= \lambda_{hi} =  \left\{ \begin{array}{ll}  2 \alpha T_i & \textrm{if } G_i G_{im}  = 0 \textrm{ (homozygous allele 1)} \\ \\  \left( \alpha + \beta \right) T_i & \textrm{if } G_i G_{im}  = 1 \textrm{ (heterozygous)} \\ \\  2 \beta T_i & \textrm{if } G_i G_{im}  = 2 \textrm{ (homozygous allele 2)} \end{array} \right.  \]  The likelihood of the parameters data  is then given by the equation  \[  \textrm{L}\left( D_h \left|  \alpha_h, \beta_h, \Phi_\bullet,  \eta_j\left| D  \right. \right) = \prod_i \Pr_{\mathrm{BNB}} \left( X = x_{ij} \left| \lambda_{hi}, \Phi_i, \eta_j \right. \right) \\ \]  where $x_{ij}$ is the number of reads for individual $i$ in target region $j$.We detail estimation of the genomewide dispersion parameter for each individual, $\Phi_i$, below.