Authorea

Jinko Graham edited untitled.tex over 7 years ago

Commit id: 11cabc6914ddbacd7d6460aa843cfaf652d2aa52

deletions | additions

$$ where $D$ is disease status ($D = 1$, case; $D=0$, control), and $G=(G_1, G_2,\ldots , G_{m})$ is an individual's multi-locus genotype at $m$ risk SNVs, with $G_j$ being the number of copies of the derived allele at the $j^{th}$ risk SNV. We randomly sampled SNVs from the middle region one at a time, until the disease prevalence was between $9.5-10.5\%$ in the $1500$ individuals. After assigning disease status to the 1500 individuals, we sampled 50 case and 50 control individuals from all affected and unaffected individuals. We then extracted the data for the variable SNVs in the case-control sample to examine the patterns of disease association in subsequent analyses. \section{Several popular methods} \subsection{Association Mapping} \bigskip We focus in this section on providing an overview of several association mapping methods, and how we used these association methods in our simulation study. \subsection{Single-variant \subsubsection{Single-variant approach} \bigskip \begin{flushleft} We evaluated Fisher's exact test, a classical tool of studying association between genotype and disease traits with the use of contingency tables. For each SNV, we tested the null hypothesis of no association between rows (disease), and columns (genotypes) of a $2\times 3$ contingency table. Each table contains the frequency of two homozygotes and the heterozygote in cases and controls. %We then computed the P-value from each association. \end{flushleft} \subsection{Pooled-variant \subsubsection{Pooled-variant methods} \bigskip The variable threshold (VT) approach of Price et al. \cite{Price_2010} is based on the regression of phenotypes onto the counts of variants meeting the MAF threshold. Variants with MAF below the threshold are assumed to be more likely to be functional than variants with higher MAF. For each possible MAF threshold, a genotype score is computed based on a given collapsing theme. The chosen MAF threshold maximizes the association signal and permutation testing is used to adjust for multiple thresholds. \citeNP{Price_2010} found that the VT approach had high power to detect the association between rare variants and disease traits when effects are in one direction in their simulations. Unlike the VT test, the C-alpha test of \citeNP{Neale_year} is a variance components approach that assumes the effects of variants are random. The C-alpha procedure tests the variance of genetic effects under the assumption that variants observed in cases and controls are a mixture of deleterious, protective or neutral variants. We applied both the VT-test and C-alpha test across the simulated region by using sliding windows of 20 SNVs overlapping by 5 SNVs. \subsection{Joint-modeling \subsubsection{Joint-modeling methods} \bigskip CAVIARBF \cite{Chen_2015} is a fine-mapping method that uses marginal test statistics for the SNVs and their pairwise association to approximate the Bayesian multivariate regression of phenotypes onto variants that is implemented in BIMBAM \cite{Servin_2005}. However, CAVIARBF is much faster than BIMBAM because it computes Bayes factors using only the SNVs in each causal model. These Bayes factors can be used to calculate the posterior probability of SNVs being causal in the region (the posterior inclusion probability). To compute the posterior inclusion probability of a SNV, a set of regression models and their Bayes factors have to be considered. Let $p$ be the total number of SNVs in a candidate region, then the number of all possible causal models is $2^p$. To reduce the number of causal models to evaluate and save computational time and effort, CAVIARBF imposes a limit, $L$, on the number of causal variants. This limitation ensures that linear interaction terms involving more than $L$ SNVs do not occur and reduces the number of models to evaluate from $2^p$ to $\sum_{i=0}^{L} \dbinom{p}{i}$. Since there were $p=2630$ SNVs in our example dataset, to keep the computational load down, we considered $L=2$ throughout this investigation. \\ Elastic-net \cite{Zou_2005} is a hybrid regularization and variable selection method that linearly combines the $L1$ and $L2$ regularization penalties of Lasso \cite{Tibshirani_2011}, and Ridge \cite{Cessie_1992} methods in multivariate regression. This combination of Lasso and Ridge penalties provides a more precise prediction than using multiple regression, when SNVs are in high linkage disequilibrium (REFERENCE REQUIRED). In addition, the elastic-net can accommodate situations in which the number of predictors exceeds the number of observations. We used the elastic-net to select risk SNVs by considering only the main effects. We used the SNV inclusion probability, a frequentist analog of the Bayesian posterior inclusion probability, as a measure of the importance of a SNV for predicting disease risk. To obtain the SNV inclusion probability, we re-fit the elastic-net model using $100$ bootstrap samples and calculated the proportion of samples in which the SNV was included in fitted model. \subsection{Tree-Based \subsubsection{Tree-based methods} \bigskip We considered two tree-based methods: Blossoc (BLOck aSSOCiation, \cite{Mailund_2006}), and the Mantel test based on the rank of coalescent events \cite{Burkett_2013}.