Authorea

Bryce van de Geijn edited Modeling read depths.tex over 9 years ago

Commit id: 2fbd6c10ea588dedd852eb4ad1a7dcafb677fc6e

deletions | additions

\subsection{Modeling the read depths} The number of reads mapping to a target region is often modelled using a poisson distribution \cite{Marioni_2008}. However, the poisson assumption that the variance is equal to the mean is often violated because read counts from target regions are overdispersed. Part of this overdispersion can be accommodated by modelling the data with a negative-binomial distribution with a variance parameter for each test\cite{Anders_2010}. test \cite{Anders_2010}. However, the negative binomial distribution assumes that the mean and variance have a quadratic relationship that is consistent across individuals. We have found that this assumption is violated by sequencing data and causes poor calibration of the tests, particularly when sample sizes are small, due to technical differences between experiments. The CHT therefore models a negative binomial overdispersion parameter, $\Omega_i$, for each individual. It also adds an additional overdispersion parameter, $\phi_h$, which models the variance at each site. After adding this additional dispersion parameter, the data are modelled with a beta-negative-binomial (BNB) distribution. The expected number of counts, $\lambda_{hi}$, is calculated based on $\alpha_h$, $\beta_h$, and the genotype, $G_{im}$ for individual $i$ at test SNP $m$. The estimate is scaled by the total number of mapped reads for individual $i$, $T_i$. \begin{equation} \lambda_{hi} = \left\{