ROUGH DRAFT authorea.com/118815

# Draft

## Introduction

Put the introduction here.

## Methods

### Data simulation

We considered a $$2$$ million base pair(Mbp) genomic region with recombination rate $$1\times 10^{-8}$$per base pair per generation. For this region, we generated 200 datasets of SNVs for a case-control study of 50 cases and 50 controls. One of the datasets was used to illustrate the association methods and all 200 replicates were used to assess the methods’ abilities to localize the genomic risk region.

We used fastsimcoal2 (Excoffier 2013) to simulate ancestral trees for $$3000$$ haplotypes of $$4000$$ equispaced SNVs in a haploid population of constant effective size, $$N_{e}=6200$$ (Tenesa 2007). These 3000 haplotypes were randomly paired into 1500 individuals.

Once haplotypes were paired into 1500 individuals, disease status was assigned to the individuals based on randomly sampled risk SNVs from the middle part of the genomic region of $$950kbp-1050kbp$$. For the risk variants, the number of copies of the derived allele increases disease risk according to a logistic regression model:

$${logit}\{P(D=1|G)\}={logit}(0.2)+\sum_{j=1}^{m}2\times G_{j}\nonumber \\$$

where $$D$$ is disease status ($$D=1$$, case; $$D=0$$, control), $$G=(G_{1},G_{2},\ldots,G_{m})$$ is an individual’s multi-locus genotype at $$m$$ risk SNVs, with $$G_{j}$$ being the number of copies of the derived allele at the $$j^{th}$$ risk SNV, and we here select the intercept term to ensure that the probability of sporadic disease is approximately $$20\%$$. We randomly sampled SNVs from the middle region one at a time, until the disease prevalence was between $$9.5-10.5\%$$ in the $$1500$$ individuals. After assigning disease status to the 1500 individuals, we sampled 50 case and 50 control individuals from all affected and unaffected individuals. We then extracted the data for the variable SNVs in the case-control sample to examine the patterns of disease association in subsequent analyses.

### Association Mapping

In this section, we review several popular methods for association mapping under four categories: single-variant approach, pooled-variant methods, joint-modeling methods and tree-based methods.

#### Single-variant approach

As single-variant approach, we evaluated Fisher’s exact test. In Fisher’s exact test, each of the variant site in the case-control sample is tested for an association with the disease outcome using $$2\times 3$$ table to compare genotype frequencies at each variant site. This single-variant association was assessed with the p-value of Fisher’s exact test on the contingency table in each dataset. Each row in a table represents disease status of individuals, and a column represents the three possible genotypes.

#### Pooled-variant methods

The variable threshold (VT) approach of Price et al. (Price 2010) is based on the regression of phenotypes onto the counts of variants meeting the MAF threshold. Variants with MAF below some threshold are more likely to be functional than variants with higher MAF. For each possible MAF threshold, a genotype score is computed based on a given collapsing theme. The chosen MAF threshold maximizes the association signal and permutation testing is used to adjust for multiple thresholds. Price et al. 2010 found that the VT approach had high power to detect the association between rare variants and disease traits when effects are in one direction in their simulations. Unlike the VT test, the C-alpha test of (Neale 2011) is a variance components approach that assumes the effects of variants are random. The C-alpha procedure tests the variance of genetic effects under the assumption that variants observed in cases and controls are a mixture of deleterious, protective or neutral variants. We applied VTWOD function in R package RVtests (Xu 2012) for VT-test and SKAT function in R package SKAT for C-alpha test across the simulated region by using sliding windows of 20 SNVs overlapping by 5 SNVs.

#### Joint-modeling methods

CAVIARBF (Chen 2015) is a fine-mapping method that uses marginal test statistics for the SNVs and their pairwise association to approximate the Bayesian multivariate regression of phenotypes onto variants that is implemented in BIMBAM (Servin 2007). However, CAVIARBF is much faster than BIMBAM because it computes Bayes factors using only the SNVs in each causal model. These Bayes factors can be used to calculate the posterior probability of SNVs being causal in the region (the posterior inclusion probability). To compute the posterior inclusion probability (PIP) of a SNV, a set of regression models and their Bayes factors have to be considered. Let $$p$$ be the total number of SNVs in a candidate region, then the number of all possible causal models is $$2^{p}$$. To reduce the number of causal models to evaluate and save computational time and effort, CAVIARBF imposes a limit, $$L$$, on the number of causal variants. This limitation ensures that linear interaction terms involving more than $$L$$ SNVs do not occur and reduces the number of models to evaluate from $$2^{p}$$ to $$\sum_{i=0}^{L}\dbinom{p}{i}$$. Since there were $$p=2747$$ SNVs in our example dataset, to keep the computational load down, we considered $$L=2$$ throughout this investigation.

Elastic-net (Zou 2005) is a hybrid regularization and variable selection method that linearly combines the $$L1$$ and $$L2$$ regularization penalties of Lasso (Tibshirani 2011), and Ridge (Cessie 1992) methods in multivariate regression. This combination of Lasso and Ridge penalties provides a more precise prediction than using multiple regression, when SNVs are in high linkage disequilibrium (citation not found: Cho_2009) . In addition, the elastic-net can accommodate situations in which the number of predictors exceeds the number of observations. We used the elastic-net to select risk SNVs by considering only the main effects. We used the SNV’s variable inclusion probability (VIP), a frequentist analog of the Bayesian posterior inclusion probability, as a measure of its importance for predicting disease risk. To obtain the VIP for a SNV, we re-fit the elastic-net model using $$100$$ bootstrap samples and calculated the proportion of samples in which the SNV was included in fitted model.

#### Tree-based methods

We considered two methods to assess clustering of disease status in genealogical trees: Blossoc (BLOck aSSOCiation, (Mailund 2006)), which uses reconstructed trees, and a Mantel test which uses the true trees. In practice, the true trees are unknown. However, the cluster statistics based on true trees represent a best case insofar as tree uncertainty is eliminated (citation not found: Burkett_2013). We therefore include two versions of the Mantel test as a benchmark for comparison. In the first version, the phenotype is scored according to whether or not the haplotype comes from a case. We refer to the first version as the naive Mantel test because all case haplotypes are treated the same, even those not carrying any risk variants. In the second version, the phenotype is scored according to whether or not the haplotype comes from a case and carries a risk variant. We refer to the second version as the informed Mantel test because it takes into account whether or not a case haplotype carries a risk variant.

Blossoc aims to localize the risk variants by reconstructing genealogical trees at each SNV. This method approximates perfect phylogenies for each SNV, assuming an infinite-sites model of mutation, and scores them according to the non-random clustering of affected individuals. The underlying idea is that genomic regions containing SNVs with high clustering scores are likely to harbour risk variants. Blossoc can be used for both phased and unphased genotype data. However, the method is impractical to apply to unphased data with more than few SNVs due to the computational burden associated with phasing. We therefore assumed the SNV data are phased, as might be done in advance with a fast-phasing algorithm such as fastPHASE (Scheet 2006), BEAGLE (Browning 2011), IMPUTE2 (Howie 2009) or MACH (Li 2010, Li 2009), and evaluated Blossoc with the phased haplotypes, using the probability-scores criterion which is the recommended scoring scheme for small datasets (Mailund 2006).

The Mantel tests correlate the pairwise distance in the known ancestry with those in the phenotypes. Following (citation not found: Burkett_2013), we used pairwise distances calculated from the rank of the coalescent event rather than the actual times on the tree. The test statistic upweights the short branches at the tip of the tree by assigning a branch-length of one to all branches, even the relatively longer branches that are close to the time to the most common ancestor. Pairwise distances between haplotypes on this re-scaled tree are then correlated to pairwise phenotypic distances. We determined the distance measure matrix, $$d_{ij}=1-s_{ij}$$, where $$s_{ij}=(y_{i}-\mu)(y_{j}-\mu)$$ is the similarity score between haplotype $$i$$ and $$j$$, $$y_{i}$$ is the binary phenotype (coded as $$0$$ or $$1$$) and $$\mu$$ is the disease prevalence in the 1500 simulated individuals. We then used the Mantel statistic to compare the phenotype-based distance matrix, $$d$$, with the re-scaled tree-distance matrix. A previous simulation study of Burkett et al. established the optimality of tree based approach for detecting association. We therefore included two versions of Mantel test (naive-Mantel test and informed-Mantel test) as a benchmark for comparison. Note that we defined a ’phenotype’ for each haplotype within an individual. Therefore, an individual has two phenotypes rather than one. In naive-Mantel test, phenotype was scored according to whether or not haplotype comes from a case. In informed-Mantel test, phenotype was scored according to whether or not haplotype comes from a case and carries a risk variant.