Authorea

Graham McVicker edited Correcting read depth and GC.tex over 9 years ago

Commit id: faef7cf3769036c61e9f55fae1beb981992165a9

deletions | additions

\subsection{Correcting for GC content and other effects on expected read depth} Since the number of mapped reads can differ between sequencing lanes and runs, we initially modeled model the expected number of counts, $\lambda_{h,i}$, $\lambda_{hi}$, as a linear function of the total number of mapped reads for each individual, $T_i$. However, technical variation between experiments can change this relationship and reduce the power to detect true differences in read depths between samples or cause spurious associations. We directly model some known sources of technical variation and estimate adjusted total read depths, $T^{*}_{i,j}$, for each individual, $i$, individual and target region, $j$. region. This gives us a more accurate estimate of the expected number of reads and boosts our ability to detect true QTLs. \subsubsection{Adjusting total read depth} In RNA-seq experiments, a small number of highly expressed genes can account for a large fraction of mapped reads. Variation in the expression level of these genes can therefore have a large effect on the number of reads that map to all other genes \cite{Wagner_2012}. Furthermore, in In ChIP-seq experiments, the fraction of reads that come from peaks varies between experiments, likely due to differences in the antibody efficacy. To account for this source of variation, we calculate an adjusted total read depth, $T^{*}_{i,j}$ for each region and individual. The adjusted read depth is defined by a quartic function of the total read depth (summed across individuals) for each target region. The coefficients of the quartic function are estimated using a maximum likelihood approach described below. \subsubsection{GC content} GC contentis also associated with affects read depth, with a relationship that varies across samples \cite{Pickrell2010; Benjamini2012}. For example, in some samples, high GC content regions have high read depth, while in other samples they have low read depth. To account for this variation, we add GC content terms to the model of adjusted total read depth, $T^{*}_{ij}$. These terms are modeled with a log linker so that $T^{*}_{ij}$ is guaranteed to be positive. After fitting this model model, as described below, we can calculate an adjusted total read depth for each region that takes into account both the GC content variation and the total read depth variation. \subsubsection{Fitting adjustment coefficients} For each target region, region $j$, we count the total number of reads across individuals, $v_j = \sum_i x_{ij}$, $v_j$ and calculate the GC content, content $w_j$. Then, for each individual $i$, we find maximum likelihood estimates of the fit coefficients $a_{0i}, a_{1i}, \ldots, b_{4i}$ b_{4i} $ that define maximize the adjusted total read depth $T^{*}_{i,j}$, given likelihood of the observedread counts and GC content: given the adjusted expected counts, $T^{*}_{ij}$: \[ \textrm{L}\left(a_{0i}, \textrm{L}\left( D_i \left| a_{0i}, a_{1i}, \ldots, b_{4i}\left| D_i \right. \right ) \right) = \prod_j \Pr_{\mathrm{Pois}} \left(X \left(X_{ij} = x_{ij} \left| T^{*}_{ij} \right. \right)\\ \] \[ T^{*}_{ij} = \exp\left(a_{0i} + a_{1i} w_j + a_{2i} w_j^2 + a_{3i} w_j^3 + a_{4i} w_j^4 \right) \left(b_{1i} v_j + b_{2i} v_j^2 + b_{3i} v_j^3 + b_{4i} v_j^4 \right)