Authorea

Charith Bhagya Karunarathna edited section_Methods_subsection_Data_simulation__.tex over 7 years ago

Commit id: 8b5cc8a6e027928a48de73779eb474f68b7b4672

deletions | additions

\end{itemize} \end{enumerate} \subsection{Several popular methods} \begin{itemize} \item Summary paragraph giving an overview of the different types of methods and the ideas motivating them. \end{itemize} \begin{enumerate} \item Single-variant approach \begin{itemize} \item Fisher's exact test \begin{itemize} \item Each of the variant site in the case-control sample is tested for an association with the disease. outcome \item $ 2\times 3 $ table constructed to compare genotype frequencies at each variant site in case controls. \begin{itemize} \item Rows are disease status of individuals, and columns correspond to three possible genotypes. \end{itemize} \item Recommended when the cell counts are small, as is expected for rare variants. \end{itemize} \item Single-variant tests are less powerful for rare variants \cite{Asimit_2010} \end{itemize} \item Pooled-variant method \begin{itemize} \item VT \cite{Price_2010}: Variants with MAF below some threshold are more likely to be functional than the variants with higher MAF. \begin{itemize} \item For each possible MAF threshold, a genotype score is computed based on given collapsing theme. The chosen MAF threshold maximizes the association signal and permutation testing is used to adjust for the multiple thresholds. \begin{itemize} \item Based on collapsing variants into a two categories: variants with MAF below the threshold, and above the threshold. \item Suitable for effects in one direction. \item \citeNP{Price_2010} found the VT approach had high power to detect the association between rare variants and disease trait in their simulations. \end{itemize} \item We used VTWOD function in RVtests R package \cite{Xu_2012}. \end{itemize} \item C-alpha \cite{Neale_2011}: Test the variance of the effect size for variants in a specific genomic window (No effect, increase or decrease risk). \begin{itemize} \item Sensitive to risk and protective variants in the same gene. \item Powerful when the effects are in different directions. \item R package: SKAT \end{itemize} \end{itemize} \item Joint-modeling method \begin{itemize} \item CAVIARBF \cite{Chen_2015} Fine mapping method using marginal test statistics for the SNVs and their pairwise association. Approximates the Bayesian multivariate regression implemented in BIMBAM \cite{Servin_2007}. %CAN YOU DESCRIBE HOW BIMBAM MODELS ALL POSSIBLE COMBINATIONS OF 1,2,3 etc. SNVS AND THEIR INTERACTION TERMS? THEN SAY THAT, TO KEEP THE COMPUTATIONAL LOAD DOWN, WE CONSIDERED ALL POSSIBLE COMBINATIONS OF SNVS UP TO PAIRS ONLY. \begin{itemize} \item To compute the probability of SNVs being causal, set of models and their Bayes factors have to be considered. Let $p$ be the total number of SNVs in a candidate region, then the all possible number of causal models is $2^p$. Since it is difficult to compute all models for large $p$, this approach has a limitation on the number of causal variants in the model. So, this limitation reduces the number of models to evaluate in the model space, to $ \sum_{i=0}^{L} \dbinom{p}{i} $, where $L$ is the number of causal SNVs in the model. Since there are 2747 SNVs in our data, to keep the computational load down, we considered $L=2$. \end{itemize} \item Elastic-net \cite{Zou_2005}: A hybrid regularization and variable selection method that linearly combines the L1 and L2 regularization penalties of the Lasso \cite{Tibshirani_2011} and Ridge \cite{Cessie_1992} methods in multivariate regression. WE CONSIDER ONLY MAIN EFFECTS FOR SNVs IN OUR ELASTIC NET MODELS. \begin{itemize} \item Particularly useful when number of predictors exceeds the number of observations. \item We select phenotype associated SNVs via elastic-net regularization from the 100 bootstrap samples. \item We performed analysis with R package glmnet \cite{Friedman_2010,Simon_2011,Tibshirani_2011a} \end{itemize} \end{itemize} \item Tree-Based method \begin{itemize} \item Reconstructed genealogical trees at each SNV (Blossoc, \citeNP{Mailund_2006}): A fast method to localize the disease-causing variants. \begin{itemize} \item Approximates perfect phylogenies for each site, assuming infinite site model of mutation and scores according to the non-random clustering of affected individuals. \item \citeNP{Mailund_2006} have found Blossoc to be a fast and accurate method to localize {\bf common} disease-causing variants but how well does it work with rare variants? \item Can use either phased or unphased genotype data. However, it is impractical to apply it to unphased data with more than a few SNPs due to the computational burden associated with phasing. We will thereform assume the SNV data are phased, as might be done in advance with a fast-phasing algorithm such as fastPHASE \cite{Scheet_2006}, BEAGLE \cite{Browning_2011}, IMPUTE2 \cite{Howie_2009} or MACH \cite{Li_2010,Li_2009}. \end{itemize} \item True trees (MT-rank of the coalescent events, \citeNP{Burkett_2013}): Detect co-clustering of the disease trait and variants on genealogical trees. \begin{itemize} \item In practice, the true trees are unknown. However, the cluster statistics based on true trees represent a best case insofar as tree uncertainty is eliminated. A previous simulation study \cite{Burkett_2013} established the optimality of these tests for detecting association. We therefore include two versions of Mantel test as a benchmark for comparison. \begin{itemize} \item Version 1: Naive-Mantel test, phenotype is scored according to whether or not haplotype comes from a case. \item Version 2: Informed-Mantel test, phenotype is scored according to whether or not haplotype comes from a case and carries a risk variant. \end{itemize} \item Upweight the short branches at the tip of the tree. %(DESCRIBE BRIEFLY HOW WE ACHIEVE UPWEIGHTING OF THE SHORT BRANCHES AT THE TIPS). We assign a branch-length of one to all branches, even the relatively longer branches that are close to the time to the most recent common ancestor. %[NOW CAN REMOVE: Expected number of time it takes for the final two of k lineages to coalesce is $ E(T_{2}) = 0.5 \times E(TMRCA) $. So, if we rank the coalescence events(i.e. intercoalescence times are 1 time unit), $ T_{2} $ becomes 1, as well as $T_{k}$ is one. So, this has the effect of upweighting the branch.] \item Success in localization was declared if the strongest signal was in the risk region. \end{itemize} \end{itemize} \item Paragraph to discuss how we scored localization and signal detection for each of these methods. \begin{itemize} \item Localization: scoring the distance of the peak signal from the risk region based on the average distance across the entire genomic region. \item Detection: For a given simulated dataset and a given method, we used max. statistics across all the SNV as our global test statistics and determined its null distribution by permuting the case-control labels. \end{itemize} \end{enumerate} \end{enumerate}