ROUGH DRAFT authorea.com/80945

# Outline

1. Inroduction

1. Area of study

• Association studies based on next-generation sequencing data

• Sequence data allows us to study both common and rare variants

• About trees underlying the sequence data (where mutation occurs on tree). CAN YOU FLESH THIS OUT WITH SOME SUBPOINTS?

• Trees at each locus in the region, represent the genealogy of haplotypes.

• Loci between recombination events have the same ancestral tree.

• Tree based approach is powerful to detect the association with disease. (Bardel 2005)

2. Brief literature review:

• The availability of whole genome and exome sequencing are enabling rapid advances in genomic research.

• Rare genetic variants can play major roles in influencing complex traits. (Pritchard 2001, Schork 2009)

• The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits. (Eichler 2010).

• However, standard methods to test for association with single genetic variants are underpowered for rare variants unless sample sizes are very large. (Li 2008)

• Overview of 3 types of analysis methods (Besides single-variant approach)

• Pooled-variant methods combine the association information across multiple variant sites within a gene. Pooling information in this way can enrich the association signal (Lee 2014)(CITE SOME REFS FOR EXAMPLES)

• Joint-modeling methods identify the joint effect of multiple genetic variants simultaneously (Cho 2010). This may be a more powerful approach than pooling marginal associations across variants when trait-influencing variants are in low linkage disequilibrium.

• Individuals carrying the same disease-predisposing variant are likely to inherit it from the same ancestor. Therefore, cases will tend to cluster together in the underlying genealogy. A method to detect clustering of the cases on the tree represents an alternative grouping method based on relatedness (CITE SOME REFS FOR EXAMPLES).

3. Purpose of the study

• To investigate the ability of several association methods to fine-map trait-influencing, causal variants within a 2Mb candidate genomic region.

• We use variant data simulated from coalescent trees. Our work extends that of Burkett et al., which investigated the ability to detect association signal in the candidate region without regard to localization.

• Work through a particular example as a case study for insight into several popular methods for association mapping.

• Simulate 200 datasets and score which method localizes best, overall.

4. Make the point here in the intro that we’ve included true trees in the comparison, even though we won’t know them in practice because in principle this should be the best result. THIS NEEDS WORK.

2. Methods

1. Data simulation (CAN YOU PLEASE FLESH OUT ALL THE SUBSECTIONS IN THIS SECTION WITH MORE SUBPOINTS)

1. Simulating the population

• fastsimcoal2 (Excoffier 2013)

• Simulate 3000 haplotypes of 4000 equispaced SNVs in a 2-Mbp region.

• Recombination rate $$= 1 \times 10^{-8}$$ per bp per generation.

• Population effective size, $$N_{e} = 6200$$ (Tenesa 2007)

• Randomly pair the 3000 haplotypes into 1500 diploid individuals.

• Logistic regression model of disease status.

• Assign disease status to the 1500 individuals based on randomly sampled risk SNVs from the mid region (950kbp - 1050kbp) and a diploid model of disease risk.

• For risk SNVs, the number of copies of the derived allele increases disease risk according to a logistic regression model, ${logit}\{P(D=1|G)\} = {logit}(0.2)+ \sum_{j=1}^{m} 2 \times G_j,\;\;\mbox{where,}$

• $$D$$ is disease status ($$D = 1$$, case; $$D=0$$, control).

• $$G=(G_1, G_2, \ldots , G_{m})$$ is an individual’s multi-locus genotype at $$m$$ risk SNVs, with $$G_j$$ being the number of copies of the derived allele at the $$j^{th}$$ risk SNV.

• We obtain $$13$$ risk SNVs by adding randomly sampled SNVs from the mid-region one-at-a-time, until the prevalence is between $$9.5-10.5\%$$ in the $$1500$$ individuals.

2. Sampling case-control data

• Sample $$50$$ cases and $$50$$ controls from all $$1500$$ individuals.

• 2630 out of 4000 SNVs were polymorphic.

• 8 out of 13 risk SNVs were polymorphic.

2. Several popular methods

• Summary paragraph giving an overview of the different types of methods and the ideas motivating them.

1. Single-variant approach

• Fisher’s exact test

• Individual variant sites are tested for an association with the disease outcome

• $$2\times 3$$ table constructed to compare genotype frequencies at each variant site in case controls.

• Rows are disease status, and columns correspond to the three possible genotypes.

• Recommended when the cell counts are small.

• Single-variant tests are less powerful for rare variants (Asimit 2010)

2. Pooled-variant method

• VT (Price 2010): Variants with MAF below some threshold are more likely to be functional than the variants with higher MAF.

• For each possible MAF threshold, a genotype score is computed based on given collapsing theme. The chosen MAF threshold maximizes the association signal and permutation testing is used to adjust for the multiple thresholds.

• Based on collapsing variants into a two categories: variants with MAF below the threshold, and above the threshold.

• Suitable for effects in one direction.

• Price et al. 2010 found the VT approach had high power to detect the association between rare variants and disease trait in their simulations.

• We used VTWOD function in RVtests R package (Xu 2012).

• C-alpha (Neale 2011): Test the variance of the effect size for variants in a specific genomic window (No effect, increase or decrease risk).

• Sensitive to risk and protective variants in the same gene.

• Powerful when the effects are in different directions.

• R package: SKAT

3. Joint-modeling method

• CAVIARBF (Chen 2015) Fine mapping method using marginal test statistics for the SNVs and their pairwise association. Approximates the Bayesian multivariate regression implemented in BIMBAM (Servin 2007). CAN YOU DESCRIBE HOW BIMBAM MODELS ALL POSSIBLE COMBINATIONS OF 1,2,3 etc. SNVS AND THEIR INTERACTION TERMS? THEN SAY THAT, TO KEEP THE COMPUTATIONAL LOAD DOWN, WE CONSIDERED ALL POSSIBLE COMBINATIONS OF SNVS UP TO PAIRS ONLY.

• To compute the probability of SNVs being causal, set of models and their Bayes factors have to be considered. Let $$p$$ be the total number of SNVs in a candidate region, then the all possible number of causal models is $$2^p$$. Since it is difficult to compute all models for large $$p$$, this approach has a limitation on the number of causal variants in the model. So, this limitation reduces the number of models to evaluate in the model space, to $$\sum_{i=0}^{L} \dbinom{p}{i}$$, where $$L$$ is the number of causal SNVs in the model. Since there are 2630 SNVs in our data, to keep the computational load down, we considered $$L=2$$.

• Elastic-net (Zou 2005): A hybrid regularization and variable selection method that linearly combines the L1 and L2 regularization penalties of the Lasso (Tibshirani 2011) and Ridge (Cessie 1992) methods in multivariate regression. WE CONSIDER ONLY MAIN EFFECTS FOR SNVs IN OUR ELASTIC NET MODELS.

• Particularly useful when number of predictors exceeds the number of observations.

• We select phenotype associated SNVs via elastic-net regularization from the 100 bootstrap samples.

• We performed analysis with R package glmnet (Friedman 2010, Simon 2011, Tibshirani 2011a)

4. Tree-Based method

• Reconstructed genealogical trees at each SNV (Blossoc, Mailund et al. 2006): A fast method to localize the disease-causing variants.

• Approximates perfect phylogenies for each site, assuming infinite site model of mutation and scores according to the non-random clustering of affected individuals.

• Mailund et al. 2006 have found Blossoc to be a fast and accurate method to localize common disease-causing variants but how well does it work with rare variants?

• Can use either phased or unphased genotype data. However, it is impractical to apply it to unphased data with more than a few SNPs due to the computational burden associated with phasing. We will thereform assume the SNV data are phased, as might be done in advance with a fast-phasing algorithm such as fastPHASE (Scheet 2006), BEAGLE (Browning 2011), IMPUTE2 (Howie 2009) or MACH (Li 2010, Li 2009).

• True trees (MT-rank of the coalescent events, Burkett et al. 2013): Detect co-clustering of the disease trait and variants on genealogical trees.

• In practice, the true trees are unknown. However, the cluster statistics based on true trees represent a best case insofar as tree uncertainty is eliminated. A previous simulation study (Burkett 2013) established the optimality of these tests for detecting association. We therefore include a clustering test based on true trees as a benchmark for comparison.

• Upweight the short branches at the tip of the tree. (DESCRIBE BRIEFLY HOW WE ACHIEVE UPWEIGHTING OF THE SHORT BRANCHES AT THE TIPS). We assign a branch-length of one to all branches, even the relatively longer branches that are close to the time to the most recent common ancestor. [NOW CAN REMOVE: Expected number of time it takes for the final two of k lineages to coalesce is $$E(T_{2}) = 0.5 \times E(TMRCA)$$. So, if we rank the coalescence events(i.e. intercoalescence times are 1 time unit), $$T_{2}$$ becomes 1, as well as $$T_{k}$$ is one. So, this has the effect of upweighting the branch.]

• Success in localization was declared if the strongest signal was in the risk region.

5. Paragraph to discuss how we will score localization for each of these methods. QUESTION: Besides recording whether the peak signal is in the risk region, can we also record how far the peak signal is from the risk region for each of the simulation replicates?

3. Results

1. Example dataset

1. Summary for population and sample

• Information on the SNVs in general (i.e. even the ones that don’t appear in the study sample) such as;

• How many rSNVs?

• What are their positions in the genomic region?

• What are their MAFs in the population?

• Do they appear in the study sample and if yes, what is the frequency of their minor allele in the sample of cases and what is the frequency of their minor allele in the controls?

• Information on the number of recombination breakpoints between the SNVs that appear in the study sample

2. LD between rSNVs and others for population

3. Single-variant statistics plot

4. Compare pooled variant statistics

5. Joint-modelling statistics

6. Between-class comparison

2. 100 datasets. (Let’s aim for 200 if possible, but start with 100 for now.)

4. Discussion

1. Review the purpose of the study

2. Evaluate the localization results on the test data

• Blossoc and known trees are the only methods that successfully localize the association signal in the test data.

• C-alpha shows higher signal than VT which is consistent with the findings of Burkett et al.

3. Evaluate the localization results from the simulation study

• Put something here

4. MAIN CONCLUSIONS FROM THE ABOVE EVALUATIONS? THAT TREE_BASED STATS ARE THE WAY TO GO?

5. Limitations of the study

• Simple model of disease risk with additive effects and no covariates.