ROUGH DRAFT authorea.com/80945

# Inroduction

## Brief literature review

• Most genetic association studies focus on common variants.

• But, rare genetic variants can play major roles in influencing complex traits. (Pritchard 2001, Schork 2009)

• The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits. (Eichler 2010).

• However, standard methods to test for association with single genetic variants are underpowered for rare variants unless sample sizes are very large. (Lee 2014)

• The lack of power of single-variant approaches holds in fine-mapping as well as genome-wide association studies.

• In this report, we are concerned with fine-mapping a genomic region that has been sequenced in cases and controls to identify disease-risk loci.

• A number of methods have been developed to evaluate the disease association for both single-variant and multiple-variants in a genomic region.

• Besides single-variant methods, we consider three broad classes of methods for analysing sequence data: pooled-variant, joint-modelling and tree-based methods.

• Overview of 3 types of analysis methods (Besides single-variant approach)

• Pooled-variant methods evaluate the cumulative effects of multiple genetic variants in a genomic region. The score statistics from marginal models of the trait association with individual variants are collapsed into a single test statistic, either by combining the information for multiple variants into a single genetic score or by evaluating the distribution of the pooled score statistics of individual variants. (Lee 2014)

• Joint-modeling methods identify the joint effect of multiple genetic variants simultaneously. These methods can assess whether a variant carries any further information about the trait beyond what is explained by the other variants. When trait-influencing variants are in low linkage disequilibrium, this approach may be more powerful than pooling test statistics for marginal associations across variants (Cho 2010).

• Tree-based methods.

• A local genealogical tree represents the ancestry of the sample of haplotypes at each locus in the genomic region being fine-mapped.

• Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease risk locus.

• Tree-based methods assess whether trait values co-cluster with the ancestral tree for the haplotypes (e.g., Bardel et al. 2005).

• Mailund et al. 2006 has developed a method to reconstruct and score genealogies according to the case-control clusters.

• Review Burkett et al. study briefly(!), what it found.

• In practice true trees are unknown. However, cluster statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated.

• Burkett et al. use known trees to assess the effectiveness of such a tree-based approach for detection of rare, disease-risk variants in a candidate genomic region under various models of disease risk in a haploid population.

• They found that Mantel statistics computed on the known trees outperform popular methods for detecting rare variants associated with disease.

• Following Burkett et al., we use clustering tests based on true trees as benchmarks against which to compare the popular association methods.

• However, unlike Burkett et al., who focus on detection of disease risk variants, we here focus on localization of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model.

## Purpose of the study

• To compare the performance of selected rare-variant association methods for fine-mapping a disease locus. In our investigation, we focus on the localization of association signal to between $$950kbp - 1050kbp$$ within a 2Mb candidate genomic region.

• We use variant data simulated from coalescent trees. Our work on localization of association signal extends that of Burkett et al., which investigated the ability to detect association signal in the candidate region, without regard to localization.

• To illustrate ideas, we start by working through a particular example dataset as a case study for insight.

• Next, we perform a simulation study involving 200 sequencing datasets and score which association method localizes best, overall.

# Methods

## Data simulation

1. Simulating the population

• fastsimcoal2 (Excoffier 2013)

• Simulate 3000 haplotypes of 4000 equispaced SNVs in a 2-Mbp region.

• Recombination rate $$= 1 \times 10^{-8}$$ per bp per generation.

• Population effective size, $$N_{e} = 6200$$ (Tenesa 2007)

• Randomly pair the 3000 haplotypes into 1500 diploid individuals.

• Logistic regression model of disease status.

• Once haplotypes were paired into 1500 diploid individuals, disease status was assigned to the individuals based on randomly sampled risk SNVs from the middle part of the genomic region of $$950kbp−1050kbp$$.

• For risk SNVs, the number of copies of the derived allele increases disease risk according to a logistic regression model, ${logit}\{P(D=1|G)\} = {logit}(0.2)+ \sum_{j=1}^{m} 2 \times G_j,\;\;\mbox{where,}$

• $$D$$ is disease status ($$D = 1$$, case; $$D=0$$, control).

• $$G=(G_1, G_2, \ldots , G_{m})$$ is an individual’s multi-locus genotype at $$m$$ risk SNVs, with $$G_j$$ being the number of copies of the derived allele at the $$j^{th}$$ risk SNV.

• We select the intercept term to ensure that the probability of sporadic disease (i.e. $$P(D=1|G=\underset{^\sim}0)$$) is approximately $$20\%$$.

• We randomly sampled SNVs from the middle region one at a time, until the disease prevalence was between $$9.5−10.5\%$$ in the $$1500$$ individuals.

• After assigning disease status to the 1500 individuals, we sampled 50 case (i.e. diseased) and 50 control (i.e. non-diseased) individuals from all affected and unaffected individuals.

• We then extracted the data for the variable SNVs in the case-control sample to examine the patterns of disease association in subsequent analyses.