Authorea

Bryce van de Geijn edited untitled.tex almost 10 years ago

Commit id: fc46a886d365b95096ad2d6bb3c728d661178753

deletions | additions

\textit{Oh, an empty article!} \documentclass{article} %\usepackage{scicite} \bibliographystyle{naturemag} \usepackage{times} \usepackage{graphicx} \usepackage{url} \usepackage{amsmath} \usepackage{lscape} You \DeclareMathOperator{\Like}{L} \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\expit}{expit} \renewcommand\thefigure{S\arabic{figure}} \renewcommand\thetable{S\arabic{table}} \topmargin 0.0cm \oddsidemargin 0.2cm \textwidth 16cm \textheight 21cm \footskip 1.0cm \title{Supplemental Material for WASP: allele-specific software for unbiased discovery of molecular quantitative trait loci} \author{Bryce van de Geijn$^{1,2,\dagger}$, Graham McVicker$^{3,\dagger}$, Yoav Gilad$^1$, Jonathan K. Pritchard$^{3,4,5,\ast}$\\ \\ \normalsize{$^{1}$Department of Human Genetics, University of Chicago}\\ \normalsize{$^{2}$Committee on Genetics, Genomics and Systems Biology, University of Chicago}\\ \normalsize{$^{3}$Department of Genetics, Stanford University}\\ \normalsize{$^{4}$Department of Biology, Stanford University}\\ \normalsize{$^{5}$Howard Hughes Medical Institute, Stanford University}\\ \normalsize{$^\dagger$These authors contributed equally.}\\ \normalsize{$^\ast$To whom correspondence should be addressed: [email protected]}\\ } \date{} \begin{document} \baselineskip 24pt \maketitle \clearpage \tableofcontents \clearpage %Supplemental Figures %----------------- % S1 - Mapping workflow % S2 - Version of Figure 1C with Bowtie instead of BWA % S3 - Biased discovery rate: true true FDR as a function % of effect size (reference proportion). Can also have % panels with different read depth, different \pi_0 % S4 - Combined test workflow % S5 - Fitted values for the splines? % S6 - % S7 - % S8 - % Supplemental Tables %---------------------------- %1 - Table of variables used in mathematical equations %2 - \section{Data sources} \subsection{RNA-seq data} % we used RNA-seq data from Pickrell2010, processed it in this way.... \subsection{ChIP-seq data} % we used ChIP-seq data that was previously collected by our lab\cite{McVicker2013}... \subsection{eQTL list} % we downloaded from Geuvadis…. \subsection{Genotypes} % 1000G + IMPUTE2 % need to change wording % We imputed genotypes and phased our samples with IMPUTE2\cite{Howie2009} %using the 1000 Genomes Phase1 integrated version 3 reference %panel\cite{1000GenomesProjectConsortium2012}. To speed up computation, we used %pre-phasing information\cite{Howie2012} from HapMap Phase II genotypes (release %22)\cite{InternationalHapMapConsortium2007}. We used the IMPUTE2 option %\verb|-filt_rules_l 'afr.maf<0.004'| to remove sites that are monomorphic or singletons %in the 246 AFR individuals in the 1000 Genomes panel and the \verb|-Ne 20000| %option to specify an effective population size of 20,000. Since the 1000 Genomes %reference panel is on the hg19 assembly, we used liftover\cite{Hinrichs2006} to %transfer HapMap genotype and phase information from hg18 to hg19. We removed %SNPs with strands, chromosomes or ordering that differed between hg18 and hg19. % TODO: how were data lifted over? How were panels merged? \section{Unbiased read mapping with WASP} \subsection{How WASP mapping works} \begin{figure}%[htbp] \centering % TODO.... %\includegraphics[width=0.8\textwidth]{mapping_workflow} %\vspace{-3cm} \caption{\textbf{Diagram of the workflow for the WASP read mapping tool.} \label{fig:CHT_workflow}} \end{figure} \subsection{Testing against N-masked and personal genomes} \section{Discovery of quantitative trait loci with WASP} To discover molecular quantitative trait loci (QTLs) WASP uses a statistical test, which we call the Combined Haplotype Test (CHT). As input, CHT takes genotype probabilities at known SNPs as well as mapped reads from next-generation sequencing experiments such as ChIP-seq or RNA-seq. CHT combines two types of information from next generation sequencing data: the depth of mapped reads and the allelic imbalance of mapped reads that overlap heterozygous sites. CHT models the overdispersion of read counts (both across regions and across individuals) and accounts for variability introduced by GC content and the fraction of reads that fall within peaks. We used an early version of CHT to discover histone mark QTLs as described in the supplementary material of our paper\cite{McVicker2013}. We have extended the statistical model, and made numerous improvements to the software implementation. These improvements include: \begin{itemize} \item Dispersion parameters are now estimated automatically from the data (rather than chosen by the user) \item Unknown covariates can get started be accommodated by \textbf{double clicking} this text block the model by providing principal component loadings \item Tested regions can be split across multiple genomic regions, such as exons \item Implementation efficiency has been improved so that the model can now be run with hundreds of individuals \end{itemize} \begin{table}[htdp] \caption{Description of mathematical variables used in the combined haplotype test} \begin{center} \small \begin{tabular}{ l l } % \hline \hline % Variable & Description\\ \hline \multicolumn{2}{l}{\textbf{Index variables}}\\ $h$ & test number (one per test SNP / target region pair) \\ $i$ & individual\\ $j$ & target region\\ $k$ & SNP within target region\\ $m$ & test SNP\\ \multicolumn{2}{l}{\textbf{Latent variables}}\\ $\alpha_h$ & molecular phenotype level of the reference allele for test $h$\\ $\beta_h$ & molecular phenotype level of the alternative allele for test $h$\\ $p_h$ & fraction of allele-specific reads expected from reference allele ($p_h = \frac{\alpha_h}{\alpha_h + \beta_h}$)\\ $T^{*}_{i,j}$ & genotype-independent expected total read count for individual $i$, target region $j$\\ $\lambda_{hi}$ & expected total read count for test $h$, individual $i$\\ $\Phi_i$ & overdispersion of read counts for individual $i$ (across all tests)\\ $\eta_j$ & overdispersion of read counts for test number $j$ (across all individuals)\\ $\Upsilon_i$ & overdispersion of allele-specific reads for individual $i$\\ \multicolumn{2}{l}{\textbf{Observed variables}}\\ $x_{ij}$ & number of reads for individual $i$, target region $j$\\ $G_{im}$ & genotype call for individual $i$, test SNP $m$\\ $T_{i}$ & total number of genome-wide mapped reads for individual $i$\\ $n_{ik}$ & total number of allele-specific reads for individual $i$, target SNP $k$\\ $y_{ik}$ & number of allele-specific reads from reference haplotype for individual $i$, target SNP $k$\\ $H_{ik}$ & probability individual $i$ is heterozygous for target SNP $k$\\ \hline \hline \end{tabular} \end{center} \label{tab:readcounts} \end{table}% \subsection{The combined haplotype test} The combined haplotype test (CHT) tests whether the genotype of a ``test SNP'', $m$, is associated with read depth and allelic imbalance within a nearby ``target region'', $j$, on the same chromosome. A single target region may be discontiguous and begin editing. You span multiple genomic loci. For example, the exons of a gene can also click be used as a single target region, which is useful when searching for expression QTLs using RNA-seq reads. The test SNP is not required to be within the \textbf{Insert} button below target region, but is assumed to add new block elements. Or you be nearby and cis-acting. This allows us to combine information from across phased heterozygous SNPs and assign reads to one haplotype or the other. \subsection{Basic model} The CHT is a likelihood ratio test which models the data with two components: read depth and allelic imbalance. These are parameterized by $\alpha_h$ and $\beta_h$, which define the expected read depth for a chromosome with the reference and alternative allele respectively. Since an additive, cis-acting variant is assumed, the expected allelic imbalance in heterozygotes, $p_h$, can \textbf{drag be calculated as $p_h = \frac{\alpha_h}{\alpha_h + \beta_h}$. \subsection{Modeling the read depths} The number of reads mapping to a target region is often modelled using a poisson distribution \cite{XXXXX}. However, the poisson assumption that the variance is equal to the mean is violated for many of the target regions. Part of this overdispersion can be accommodated by modelling the data with a negative-binomial distribution with a variance parameter, $\eta_h$, for each test\cite{Anders2010}. However, the negative binomial distribution assumes that mean and drop variance have a quadratic relationship that is consistent across individuals. We have found that this assumption is violated by sequencing data and causes poor calibration of the tests, particularly when sample sizes are small. The CHT therefore includes an image} right onto additional overdispersion parameter for each individual, $\Phi_i$, which is fit across the genome. After adding this additional dispersion parameter, the data are modelled with a beta-negative-binomial (BNB) distribution. \subsection{Correcting for GC content and other effects on expected read depth} Since the number of mapped reads can differ between sequencing lanes and runs, we initially model the expected number of counts, $\lambda_{h,i}$, as a linear function of the total number of mapped reads for each individual, $T_i$. However, technical variation between experiments can change this relationship and reduce the power to detect true differences in read depths between samples or cause spurious associations. We directly model some known sources of technical variation and estimate adjusted total read depths, $T^{*}_{i,j}$, for each individual and target region. This gives us a more accurate estimate of the expected number of reads and boosts our ability to detect true QTLs. \subsubsection{Adjusting total read depth} In RNA-seq experiments, a small number of highly expressed genes can account for a large fraction of mapped reads \cite{XXXXX}. Variation in the expression level of these genes can therefore have a large effect on the number of reads that map to all other genes. In ChIP-seq experiments, the fraction of reads that come from peaks varies between experiments, likely due to differences in the antibody efficacy. We therefore calculate $T^{*}_{i,j}$ as a quartic function of the total read depth for each target region summed across individuals. The coefficients for the quartic function are unique to each individual and are fit using a maximum likelihood approach described below. \subsubsection{GC content} GC content affects read depth, but with a relationship that varies across samples \cite{Pickrell2010; Benjamini2012}. For example, in some samples, high GC content regions have high read depth, while in other samples they have read depth. To account for this variation, we model the effect of GC content on read depth for each individual across all target regions. After fitting this text. Happy writing! model, we calculate an adjusted total read depth for each region $T^{*}_{ij}$ that takes into account the GC content variation. \subsubsection{Fitting adjustment coefficients} We count the total number of reads and calculate the GC content of each target region. Then, for each individual we fit coefficients that maximize the likelihood of the observed counts given the adjusted expected counts. \[ \Like\left( D_i \left| a0_i, a1_i, \ldots, b4_i \right. \right) = \prod_j \Pr_{\mathrm{Pois}} \left(X_{ij} = x_{ij} \left| T^{*}_{ij} \right. \right) T^{*}_{ij} = exp(a0_i + gc_j*a1_i + gc_j^2*a2_i + gc_j^3*a3_i +gc_j^4*a4_i)*(tot_j * b1_i + tot_j^2*b2_i + tot_j^3*b3_i +tot_j^4*b4_i) \] \subsection{Modeling the allelic imbalances} Reads that can be assigned to one haplotype or the other can be modelled using the binomial distribution \cite{XXXX}. Model allelic imbalance using a binomial distribution