Authorea

Bryce van de Geijn added Modeling read depths.tex almost 10 years ago

Commit id: abea31be44670126a71869ce59d8b822de1ba309

deletions | additions

\subsection{Modeling the read depths} The number of reads mapping to a target region is often modelled using a poisson distribution \cite{XXXXX}. However, the poisson assumption that the variance is equal to the mean is violated for many of the target regions. Part of this overdispersion can be accommodated by modelling the data with a negative-binomial distribution with a variance parameter, $\eta_h$, for each test\cite{Anders2010}. However, the negative binomial distribution assumes that mean and variance have a quadratic relationship that is consistent across individuals. We have found that this assumption is violated by sequencing data and causes poor calibration of the tests, particularly when sample sizes are small. The CHT therefore includes an additional overdispersion parameter for each individual, $\Phi_i$, which is fit across the genome. After adding this additional dispersion parameter, the data are modelled with a beta-negative-binomial (BNB) distribution. The expected number of counts, \lambda_hi, is calculated based on $\alpha_h$, $\beta_h$, and the genotype for individual i at test the test snp. The estimate is scaled by the total number of mapped reads for individual i, $T_i$. \[ \lambda_i= \left\{ \begin{array}{ll} 2 \alpha T_i & \textrm{if } G_i = 0 \textrm{ (homozygous allele 1)} \\ \\ \left( \alpha + \beta \right) T_i & \textrm{if } G_i = 1 \textrm{ (heterozygous)} \\ \\ 2 \beta T_i & \textrm{if } G_i = 2 \textrm{ (homozygous allele 2)} \end{array} \right. \] The likelihood of the data is then given by the equation \[ \Like\left( D_hi \left| \alpha_h, \beta_h, \Phi_i, \eta_j \right. \right) = \Pr_{\mathrm{BNB}} \left( X = x_hi \left| \lambda_hi, \Phi_i, \eta_j \right. \right) \\ \]