Authorea

Charith Bhagya Karunarathna edited untitled.tex over 7 years ago

Commit id: 7091bf2c40318bed0aa0d36dcf46add3180bab5b

deletions | additions

\section{Outline} \begin{enumerate} \item Inroduction \begin{enumerate} \item Brief \section{Inroduction} \subsection {Brief literature review: review} \begin{itemize} \item Most genetic association studies focus on common variants.%( which are effective for common disease caused by common variants). \item But, rare genetic variants can play major roles in influencing complex traits. \cite{Pritchard_2001,Schork_2009}

\item Tree-based methods. \begin{itemize} \item A local genealogical tree represents the ancestry of the sample of haplotypes at each locus in the genomic region being fine-mapped. \item Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease risk locus. \item Tree-based methods assess whether trait values co-cluster with the ancestral tree for the haplotypes (e.g., \cite{Bardel_2005}). \item Mailund et al. (2006) has developed a method to score the genealogies according to the case-control clusters and construct local ancestral trees. \end{itemize}

\item Unlike Burkett et al., who focus on {\em detection} of disease risk variants, we focus on {\em localization} of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model. \end{itemize} \end{itemize} \end{itemize}\item Purpose of the study \begin{itemize} \item To compare the performance of selected rare-variant association methods for fine-mapping a disease locus. In our investigation, we focus on the localization of association signal within a 2Mb candidate genomic region. \item We use variant data simulated from coalescent trees. Our work on localization of association signal extends that of Burkett et al., which investigated the ability to detect association signal in the candidate region, without regard to localization. \item To illustrate ideas, we start by working through a particular example dataset as a case study for insight. \item Next, we perform a simulation study involving 200 sequencing datasets and score which association method localizes best, overall. \end{itemize} \item Benchmarks with true trees. \begin{itemize} \item A gene genealogy describes the relationship between individual genes sampled from the population. \item In this report, we consider local genealogical trees, which represent the ancestry of the sample at a given locus in the genomic region being fine-mapped. \item Haplotypes carrying the same disease risk alleles are expected to cluster on a local tree at the disease risk locus. \item In practice true trees are unknown. However, cluster statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated. \item Following Burkett et al., we use clustering tests based on true trees as benchmarks against which to compare the popular association methods. \end{itemize} \end{enumerate} \item Methods \begin{enumerate} \item Data simulation %(CAN YOU PLEASE FLESH OUT ALL THE SUBSECTIONS IN THIS SECTION WITH MORE SUBPOINTS) \begin{enumerate} \item Simulating the population \begin{itemize} \item fastsimcoal2 \cite{Excoffier_2013} \begin{itemize} \item Simulate 3000 haplotypes of 4000 equispaced SNVs in a 2-Mbp region. \item Recombination rate $= 1 \times 10^{-8}$ per bp per generation. \item Population effective size, $N_{e} = 6200$ \cite{Tenesa_2007} \item Randomly pair the 3000 haplotypes into 1500 diploid individuals. \end{itemize} \item Logistic regression model of disease status. \begin{itemize} \item Assign disease status to the 1500 individuals based on randomly sampled risk SNVs from the mid region (950kbp - 1050kbp) and a diploid model of disease risk. \item For risk SNVs, the number of copies of the derived allele increases disease risk according to a logistic regression model, $$ {logit}\{P(D=1|G)\} = {logit}(0.2)+ \sum_{j=1}^{m} 2 \times G_j,\;\;\mbox{where,} $$ \item $D$ is disease status ($D = 1$, case; $D=0$, control). \item $G=(G_1, G_2, \ldots , G_{m})$ is an individual's multi-locus genotype at $m$ risk SNVs, with $G_j$ being the number of copies of the derived allele at the $j^{th}$ risk SNV. \item We select the intercept term to ensure that the probability of sporadic disease is approximately $20\%$. \end{itemize} \item We obtain $16$ risk SNVs by adding randomly sampled SNVs from the mid-region one-at-a-time, until the prevalence is between $9.5-10.5\%$ in the $1500$ individuals. \end{itemize} \item Sampling case-control data \begin{itemize} \item Sample $50$ cases and $50$ controls from all $1500$ individuals. \item 2747 out of 4000 SNVs were polymorphic. \item 10 out of 16 risk SNVs were polymorphic. \end{itemize} \end{enumerate} \item Several popular methods \begin{itemize} \item Summary paragraph giving an overview of the different types of methods and the ideas motivating them. \end{itemize} \begin{enumerate} \item Single-variant approach \begin{itemize} \item Fisher's exact test \begin{itemize} \item Each of the variant site in the case-control sample is tested for an association with the disease. outcome \item $ 2\times 3 $ table constructed to compare genotype frequencies at each variant site in case controls. \begin{itemize} \item Rows are disease status of individuals, and columns correspond to three possible genotypes. \end{itemize} \item Recommended when the cell counts are small, as is expected for rare variants. \end{itemize} \item Single-variant tests are less powerful for rare variants \cite{Asimit_2010} \end{itemize} \item Pooled-variant method \begin{itemize} \item VT \cite{Price_2010}: Variants with MAF below some threshold are more likely to be functional than the variants with higher MAF. \begin{itemize} \item For each possible MAF threshold, a genotype score is computed based on given collapsing theme. The chosen MAF threshold maximizes the association signal and permutation testing is used to adjust for the multiple thresholds. \begin{itemize} \item Based on collapsing variants into a two categories: variants with MAF below the threshold, and above the threshold. \item Suitable for effects in one direction. \item \citeNP{Price_2010} found the VT approach had high power to detect the association between rare variants and disease trait in their simulations. \end{itemize} \item We used VTWOD function in RVtests R package \cite{Xu_2012}. \end{itemize} \item C-alpha \cite{Neale_2011}: Test the variance of the effect size for variants in a specific genomic window (No effect, increase or decrease risk). \begin{itemize} \item Sensitive to risk and protective variants in the same gene. \item Powerful when the effects are in different directions. \item R package: SKAT \end{itemize} \end{itemize} \item Joint-modeling method \begin{itemize} \item CAVIARBF \cite{Chen_2015} Fine mapping method using marginal test statistics for the SNVs and their pairwise association. Approximates the Bayesian multivariate regression implemented in BIMBAM \cite{Servin_2007}. %CAN YOU DESCRIBE HOW BIMBAM MODELS ALL POSSIBLE COMBINATIONS OF 1,2,3 etc. SNVS AND THEIR INTERACTION TERMS? THEN SAY THAT, TO KEEP THE COMPUTATIONAL LOAD DOWN, WE CONSIDERED ALL POSSIBLE COMBINATIONS OF SNVS UP TO PAIRS ONLY. \begin{itemize} \item To compute the probability of SNVs being causal, set of models and their Bayes factors have to be considered. Let $p$ be the total number of SNVs in a candidate region, then the all possible number of causal models is $2^p$. Since it is difficult to compute all models for large $p$, this approach has a limitation on the number of causal variants in the model. So, this limitation reduces the number of models to evaluate in the model space, to $ \sum_{i=0}^{L} \dbinom{p}{i} $, where $L$ is the number of causal SNVs in the model. Since there are 2747 SNVs in our data, to keep the computational load down, we considered $L=2$. \end{itemize} \item Elastic-net \cite{Zou_2005}: A hybrid regularization and variable selection method that linearly combines the L1 and L2 regularization penalties of the Lasso \cite{Tibshirani_2011} and Ridge \cite{Cessie_1992} methods in multivariate regression. WE CONSIDER ONLY MAIN EFFECTS FOR SNVs IN OUR ELASTIC NET MODELS. \begin{itemize} \item Particularly useful when number of predictors exceeds the number of observations. \item We select phenotype associated SNVs via elastic-net regularization from the 100 bootstrap samples. \item We performed analysis with R package glmnet \cite{Friedman_2010,Simon_2011,Tibshirani_2011a} \end{itemize} \end{itemize} \item Tree-Based method \begin{itemize} \item Reconstructed genealogical trees at each SNV (Blossoc, \citeNP{Mailund_2006}): A fast method to localize the disease-causing variants. \begin{itemize} \item Approximates perfect phylogenies for each site, assuming infinite site model of mutation and scores according to the non-random clustering of affected individuals. \item \citeNP{Mailund_2006} have found Blossoc to be a fast and accurate method to localize {\bf common} disease-causing variants but how well does it work with rare variants? \item Can use either phased or unphased genotype data. However, it is impractical to apply it to unphased data with more than a few SNPs due to the computational burden associated with phasing. We will thereform assume the SNV data are phased, as might be done in advance with a fast-phasing algorithm such as fastPHASE \cite{Scheet_2006}, BEAGLE \cite{Browning_2011}, IMPUTE2 \cite{Howie_2009} or MACH \cite{Li_2010,Li_2009}. \end{itemize} \item True trees (MT-rank of the coalescent events, \citeNP{Burkett_2013}): Detect co-clustering of the disease trait and variants on genealogical trees. \begin{itemize} \item In practice, the true trees are unknown. However, the cluster statistics based on true trees represent a best case insofar as tree uncertainty is eliminated. A previous simulation study \cite{Burkett_2013} established the optimality of these tests for detecting association. We therefore include two versions of Mantel test as a benchmark for comparison. \begin{itemize} \item Version 1: Naive-Mantel test, phenotype is scored according to whether or not haplotype comes from a case. \item Version 2: Informed-Mantel test, phenotype is scored according to whether or not haplotype comes from a case and carries a risk variant. \end{itemize} \item Upweight the short branches at the tip of the tree. %(DESCRIBE BRIEFLY HOW WE ACHIEVE UPWEIGHTING OF THE SHORT BRANCHES AT THE TIPS). We assign a branch-length of one to all branches, even the relatively longer branches that are close to the time to the most recent common ancestor. %[NOW CAN REMOVE: Expected number of time it takes for the final two of k lineages to coalesce is $ E(T_{2}) = 0.5 \times E(TMRCA) $. So, if we rank the coalescence events(i.e. intercoalescence times are 1 time unit), $ T_{2} $ becomes 1, as well as $T_{k}$ is one. So, this has the effect of upweighting the branch.] \item Success in localization was declared if the strongest signal was in the risk region. \end{itemize} \end{itemize} \item Paragraph to discuss how we scored localization and signal detection for each of these methods. \begin{itemize} \item Localization: scoring the distance of the peak signal from the risk region based on the average distance across the entire genomic region. \item Detection: For a given simulated dataset and a given method, we used max. statistics across all the SNV as our global test statistics and determined its null distribution by permuting the case-control labels. \end{itemize} \end{enumerate} \end{enumerate} \item Results \begin{enumerate} \item Example dataset \begin{enumerate} \item Summary for population and sample \begin{itemize} \item Information on the SNVs in general (i.e. even the ones that don't appear in the study sample) such as; \begin{itemize} \item How many rSNVs? \item What are their positions in the genomic region? \item What are their MAFs in the population? \item Do they appear in the study sample and if yes, what is the frequency of their minor allele in the sample of cases and what is the frequency of their minor allele in the controls? \end{itemize} \item Information on the number of recombination breakpoints between the SNVs that appear in the study sample \end{itemize} \item LD between rSNVs and others for population \item Single-variant statistics plot \item Compare pooled variant statistics \item Joint-modelling statistics \item How many haplotypes that carry rSNVs for both cases and controls. (Clustered bar chart) \end{enumerate} \item 200 datasets. \begin{itemize} \item Localizing the signal: ecdf of avg. distance from the peak \item Association signal detection: ecdf of p-values \end{itemize} \end{enumerate} \item Discussion \begin{enumerate} \item Review the purpose of the study and what we did. \begin{itemize} \item Through coalescent simulation we have investigated the ability of several popular association methods to fine-map trait-influencing genetic variants. \item Worked through a particular example data set as a case study for insight into these popular methods and performed a simulation study to score which method localizes the risk region the best. \end{itemize} \item Evaluate the localization results on the example data set. \begin{itemize} \item C-alpha test, based on the genotypes at all loci in the region and informed Mantel test which is based on known tress, are the only methods that successfully localize the association signal in the test data. \item The peak signal from all the other methods (Fisher's test, VT, CAVIARBF, Elastic-net, Blossoc and naive-Mantel) is close to the disease risk region. \item Even though the effects are one directional, C-alpha shows higher localization signal in the risk region than VT. Our results for localization are consistent with the results of Burkett et al for detection. \end{itemize} \item Evaluate the localization results from the simulation study. \begin{itemize} \item Not surprisingly, informed-Mantel test, the bench mark outperformed all the other methods. However, Blossoc, CAVIARBF, C-alpha and Fisher's exact test performed comparably well in localizing the signal. \item Interestingly, performances were poor for both naive Mantel and VT tests. \end{itemize} \item Evaluate the association signal detection from the simulation study. \begin{itemize} \item Informed-Mantel bench mark test performed extremely well, as expected. \item C-alpha, CAVIARBF, and Blossoc showed reasonably better performance results than VT, Elastic-net and Fisher's exact test. \item VT did not perform much better than Fisher's exact test which is a similar result described in Burkett et al. for haploid data. \end{itemize} \item New way of scoring phenotypes would be another area to do research using simulated trees. \item Limitations of the study \begin{itemize} \item Simple model of disease risk with additive effects and no covariates. \item Had to limit the number of causal SNVs in the model for CAVIARBF to ensure computational feasibility for this method. % \item Burkett et al. used haploid model. But we used diploid model. \end{itemize} \end{enumerate} \end{enumerate}