ROUGH DRAFT authorea.com/80945
Main Data History
Export
Show Index Toggle 5 comments
  •  Quick Edit
  • Outline

    Outline

    1. Inroduction

      1. Area of study

        • Gene genealogy describes the relationship between individual genes sampled from a general population.

        • It has a potential to help identify genetic variants that contribute to a specific disease.

        • Identifying disease causal genetic variants may help contribute to the development of preventative and disease modifying therapies.

        • To identify disease causal genetic variants, we can use Genetic association studies with case-control study design.

        • About trees underlying the sequence data (where mutation occurs on tree).

          • Trees at each locus in the genomic region, represent the genealogy of a sample haplotypes.

          • Loci between recombination events have the same ancestral tree.

          • Tree based approach is powerful to detect the association with disease. (Bardel 2005)

      2. Brief literature review:

        • Rare genetic variants can play major roles in influencing complex traits. (Pritchard 2001, Schork 2009)

        • The rare susceptibility variants identified through sequencing have potential to explain some of the ’missing heritability’ of complex traits. (Eichler 2010).

        • However, standard methods to test for association with single genetic variants are underpowered for rare variants unless sample sizes are very large. (Li 2008)

        • Overview of 3 types of analysis methods (Besides single-variant approach)

          • Pooled-variant methods combine the association information across multiple variant sites within a gene. Pooling information in this way can enrich the association signal (Lee 2014)

          • Joint-modeling methods identify the joint effect of multiple genetic variants simultaneously (Cho 2010). This may be a more powerful approach than pooling marginal associations across variants when trait-influencing variants are in low linkage disequilibrium.

          • Individuals carrying the same disease-predisposing variant are likely to inherit it from the same ancestor. Therefore, cases will tend to cluster together in the underlying genealogy. A method to detect clustering of the cases on the tree represents an alternative grouping method based on relatedness (Burkett 2013).

      3. Purpose of the study

        • To investigate the ability of several association methods to fine-map trait-influencing, causal variants within a 2Mb candidate genomic region.

        • We use variant data simulated from coalescent trees. Our work extends that of Burkett et al., which investigated the ability to detect association signal in the candidate region without regard to localization.

        • Work through a particular example as a case study for insight into several popular methods for association mapping.

        • Simulate 200 datasets and score which method localizes best, overall.

      4. Make the point here in the intro that we’ve included true trees in the comparison, even though we won’t know them in practice because in principle this should be the best result.

        • Genealogical tree represents the ancestry of the sample at each locus in the region.

        • Individuals carrying the same disease risk alleles tend to cluster on a tree at the disease risk locus.

        • In practice true trees are unknown. However, cluster statistics based on true trees represent a best case for detection association as tree uncertainty is eliminated.

        • Burkett et al. etablished the optimality of these tree tests for detecting association. We therefore used clustering test based on true trees as a bench mark.

    2. Methods

      1. Data simulation

        1. Simulating the population

          • fastsimcoal2 (Excoffier 2013)

            • Simulate 3000 haplotypes of 4000 equispaced SNVs in a 2-Mbp region.

            • Recombination rate \(= 1 \times 10^{-8}\) per bp per generation.

            • Population effective size, \(N_{e} = 6200\) (Tenesa 2007)

            • Randomly pair the 3000 haplotypes into 1500 diploid individuals.

          • Logistic regression model of disease status.

            • Assign disease status to the 1500 individuals based on randomly sampled risk SNVs from the mid region (950kbp - 1050kbp) and a diploid model of disease risk.

            • For risk SNVs, the number of copies of the derived allele increases disease risk according to a logistic regression model, \[{logit}\{P(D=1|G)\} = {logit}(0.2)+ \sum_{j=1}^{m} 2 \times G_j,\;\;\mbox{where,}\]

            • \(D\) is disease status (\(D = 1\), case; \(D=0\), control).

            • \(G=(G_1, G_2, \ldots , G_{m})\) is an individual’s multi-locus genotype at \(m\) risk SNVs, with \(G_j\) being the number of copies of the derived allele at the \(j^{th}\) risk SNV.

            • We select the intercept term to ensure that the probability of sporadic disease is approximately \(20\%\).

          • We obtain \(16\) risk SNVs by adding randomly sampled SNVs from the mid-region one-at-a-time, until the prevalence is between \(9.5-10.5\%\) in the \(1500\) individuals.

        2. Sampling case-control data

          • Sample \(50\) cases and \(50\) controls from all \(1500\) individuals.

          • 2747 out of 4000 SNVs were polymorphic.

          • 10 out of 16 risk SNVs were polymorphic.

      2. Several popular methods

        • Summary paragraph giving an overview of the different types of methods and the ideas motivating them.

        1. Single-variant approach

          • Fisher’s exact test

            • Each of the variant site in the case-control sample is tested for an association with the disease. outcome

            • \( 2\times 3 \) table constructed to compare genotype frequencies at each variant site in case controls.

              • Rows are disease status of individuals, and columns correspond to three possible genotypes.

            • Recommended when the cell counts are small, as is expected for rare variants.

          • Single-variant tests are less powerful for rare variants (Asimit 2010)

        2. Pooled-variant method

          • VT (Price 2010): Variants with MAF below some threshold are more likely to be functional than the variants with higher MAF.

            • For each possible MAF threshold, a genotype score is computed based on given collapsing theme. The chosen MAF threshold maximizes the association signal and permutation testing is used to adjust for the multiple thresholds.

              • Based on collapsing variants into a two categories: variants with MAF below the threshold, and above the threshold.

              • Suitable for effects in one direction.

              • Price et al. 2010 found the VT approach had high power to detect the association between rare variants and disease trait in their simulations.

            • We used VTWOD function in RVtests R package (Xu 2012).

          • C-alpha (Neale 2011): Test the variance of the effect size for variants in a specific genomic window (No effect, increase or decrease risk).

            • Sensitive to risk and protective variants in the same gene.

            • Powerful when the effects are in different directions.

            • R package: SKAT

        3. Joint-modeling method

          • CAVIARBF (Chen 2015) Fine mapping method using marginal test statistics for the SNVs and their pairwise association. Approximates the Bayesian multivariate regression implemented in BIMBAM (Servin 2007).

            • To compute the probability of SNVs being causal, set of models and their Bayes factors have to be considered. Let \(p\) be the total number of SNVs in a candidate region, then the all possible number of causal models is \(2^p\). Since it is difficult to compute all models for large \(p\), this approach has a limitation on the number of causal variants in the model. So, this limitation reduces the number of models to evaluate in the model space, to \( \sum_{i=0}^{L} \dbinom{p}{i} \), where \(L\) is the number of causal SNVs in the model. Since there are 2747 SNVs in our data, to keep the computational load down, we considered \(L=2\).

          • Elastic-net (Zou 2005): A hybrid regularization and variable selection method that linearly combines the L1 and L2 regularization penalties of the Lasso (Tibshirani 2011) and Ridge (Cessie 1992) methods in multivariate regression. WE CONSIDER ONLY MAIN EFFECTS FOR SNVs IN OUR ELASTIC NET MODELS.

            • Particularly useful when number of predictors exceeds the number of observations.

            • We select phenotype associated SNVs via elastic-net regularization from the 100 bootstrap samples.

            • We performed analysis with R package glmnet (Friedman 2010, Simon 2011, Tibshirani 2011a)

        4. Tree-Based method

          • Reconstructed genealogical trees at each SNV (Blossoc, Mailund et al. 2006): A fast method to localize the disease-causing variants.

            • Approximates perfect phylogenies for each site, assuming infinite site model of mutation and scores according to the non-random clustering of affected individuals.

            • Mailund et al. 2006 have found Blossoc to be a fast and accurate method to localize common disease-causing variants but how well does it work with rare variants?

            • Can use either phased or unphased genotype data. However, it is impractical to apply it to unphased data with more than a few SNPs due to the computational burden associated with phasing. We will thereform assume the SNV data are phased, as might be done in advance with a fast-phasing algorithm such as fastPHASE (Scheet 2006), BEAGLE (Browning 2011), IMPUTE2 (Howie 2009) or MACH (Li 2010, Li 2009).

          • True trees (MT-rank of the coalescent events, Burkett et al. 2013): Detect co-clustering of the disease trait and variants on genealogical trees.

            • In practice, the true trees are unknown. However, the cluster statistics based on true trees represent a best case insofar as tree uncertainty is eliminated. A previous simulation study (Burkett 2013) established the optimality of these tests for detecting association. We therefore include two versions of Mantel test as a benchmark for comparison.

              • Version 1: Naive-Mantel test, phenotype is scored according to whether or not haplotype comes from a case.

              • Version 2: Informed-Mantel test, phenotype is scored according to whether or not haplotype comes from a case and carries a risk variant.

            • Upweight the short branches at the tip of the tree. We assign a branch-length of one to all branches, even the relatively longer branches that are close to the time to the most recent common ancestor.

            • Success in localization was declared if the strongest signal was in the risk region.

        5. Paragraph to discuss how we scored localization and signal detection for each of these methods.

          • Localization: scoring the distance of the peak signal from the risk region based on the average distance across the entire genomic region.

          • Detection: For a given simulated dataset and a given method, we used max. statistics across all the SNV as our global test statistics and determined its null distribution by permuting the case-control labels.

    3. Results

      1. Example dataset

        1. Summary for population and sample

          • Information on the SNVs in general (i.e. even the ones that don’t appear in the study sample) such as;

            • How many rSNVs?

            • What are their positions in the genomic region?

            • What are their MAFs in the population?

            • Do they appear in the study sample and if yes, what is the frequency of their minor allele in the sample of cases and what is the frequency of their minor allele in the controls?

          • Information on the number of recombination breakpoints between the SNVs that appear in the study sample

        2. LD between rSNVs and others for population

        3. Single-variant statistics plot

        4. Compare pooled variant statistics

        5. Joint-modelling statistics

        6. How many haplotypes that carry rSNVs for both cases and controls. (Clustered bar chart)

      2. 200 datasets.

        • Localizing the signal: ecdf of avg. distance from the peak

        • Association signal detection: ecdf of p-values

    4. Discussion

      1. Review the purpose of the study and what we did.

        • Through coalescent simulation we have investigated the ability of several popular association methods to fine-map trait-influencing genetic variants.

        • Worked through a particular example data set as a case study for insight into these popular methods and performed a simulation study to score which method localizes the risk region the best.

      2. Evaluate the localization results on the example data set.

        • C-alpha test, based on the genotypes at all loci in the region and informed Mantel test which is based on known tress, are the only methods that successfully localize the association signal in the test data.

        • The peak signal from all the other methods (Fisher’s test, VT, CAVIARBF, Elastic-net, Blossoc and naive-Mantel) is close to the disease risk region.

        • Even though the effects are one directional, C-alpha shows higher localization signal in the risk region than VT. Our results for localization are consistent with the results of Burkett et al for detection.

      3. Evaluate the localization results from the simulation study.

        • Not surprisingly, informed-Mantel test, the bench mark outperformed all the other methods. However, Blossoc, CAVIARBF, C-alpha and Fisher’s exact test performed comparably well in localizing the signal.

        • Interestingly, performances were poor for both naive Mantel and VT tests.

      4. Evaluate the association signal detection from the simulation study.

        • Informed-Mantel bench mark test performed extremely well, as expected.

        • C-alpha, CAVIARBF, and Blossoc showed reasonably better performance results than VT, Elastic-net and Fisher’s exact test.

        • VT did not perform much better than Fisher’s exact test which is a similar result described in Burkett et al. for haploid data.

      5. New way of scoring phenotypes would be another area to do research using simulated trees.

      6. Limitations of the study

        • Simple model of disease risk with additive effects and no covariates.

        • Had to limit the number of causal SNVs in the model for CAVIARBF to ensure computational feasibility for this method.

        • Burkett et al. used haploid model. But we used diploid model.