Using gene genealogies to localize rare variants associated with complex traits in diploid populations

Introduction

Most genetic association studies focus on common variants, but rare genetic variants can play major roles in influencing complex traits.(Pritchard 2001, Schork 2009). The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits (Eichler 2010). However, for rare variants, standard methods to test for association with single genetic variants are underpowered unless sample sizes are very large (Lee 2014). The lack of power of single-variant approaches holds in fine-mapping as well as genome-wide association studies.

In this report, we are concerned with fine-mapping a genomic region that has been sequenced in cases and controls to identify disease-risk loci. Our work extends an earlier comparison of methods for detecting disease association in cases and controls (Burkett 2014) to a comparison of methods for localizing the association signal. In the previous investigation, cases and controls were sampled from a haploid or one-parent population. However, in the current investigation, cases and controls are sampled from a diploid or two-parent population to mimic studies in human populations.

A number of methods have been developed to evaluate the disease association for both a single variant and multiple variants in a genomic region. Besides single-variant methods, we consider three broad classes of methods for analysing sequence data: pooled-variant, joint-modelling and tree-based methods. Pooled-variant methods evaluate the cumulative effects of multiple genetic variants in a genomic region. The score statistics from marginal models of the trait association with individual variants are collapsed into a single test statistic by combining the information for multiple variants into a single genetic score (Lee 2014). Joint-modeling methods model the joint effect of multiple genetic variants on the trait simultaneously. These methods can assess whether a variant carries any further information about the trait beyond what is explained by the other variants. When trait-influencing variants are in low linkage disequilibrium, this approach may be more powerful than pooling test statistics for marginal associations across variants (Cho 2010). Tree-based methods assess whether trait values co-cluster with the local genealogical tree for the haplotypes (e.g., Bardel et al. 2005). A local genealogical tree represents the ancestry of the sample of haplotypes at each locus. Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease-risk locus. Mailund et al. 2006 has developed a method to reconstruct and score local genealogies according to the case-control clusters.

In practice true trees are unknown. However, clustering statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated. Burkett et al. 2014 used known trees to assess the effectiveness of such a tree-based approach for detection of disease-risk variants in a haploid population. They found that clustering statistics computed on the known trees outperform popular methods for detecting causal rare variants in a candidate genomic region. Following Burkett et al., we use Mantel tests as the clustering statistics based on true trees. These tree-based statistics, which rely on known trees, serve as benchmarks against which to compare the popular association methods. However, unlike Burkett et al., who focus on detection of disease-risk variants, we here focus on localization of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model.

In this report, we compare the performance of selected association methods for fine-mapping a disease locus in the middle of a larger, candidate, genomic region. In our simulation study, we use variant data generated under the coalescent model. To illustrate ideas, we start by working through a particular example dataset as a case study for insight into the association methods. We next perform a simulation study involving 200 sequencing datasets and score which association method localizes best, overall. Our results indicate the potential of ancestral tree-based approaches for localizing the association signal.