INTRODUCTION - This tiny fraction of explained variations triggers the idea of missing heritability. The advancement in sequencing technology showed that there are more rare variants in the human genome than the original estimation. - Hence, the investigation to find the role of rare variants in the development of human disease is an optimistic ongoing research area . - Rare genetic variants, defined as alleles with the derived allele frequency of less than 5% in the population, could potentially play an important role in influencing complex diseases and traits. - Rare genetic variants are also anticipated to have a larger influence on the trait rather than common variants # - Different methods have been introduced for fine mapping rare genetic variants. However, if the sample size is not large enough, these methods lack of enough statistical power . - For example, the sequence kernel association test (SKAT) has been proposed by as a flexible and efficient regression method to analyze the association between both rare and common genetic variants in a region with a continuous or binary trait value. - SKAT is based on encapsulating the rare variants within a region of genome to a single value. This single value is later being used for SNP association testing. However, this test still lacks enough statistical power for small sample sizes. # - The gene genealogy describes the relationship among independent sequence haplotypes that have been sampled from the population. The gene genealogy of a sample can be helpful in fine mapping regions containing multiple rare causal variants. - In the presence of multiple causal rare variants in a genomic region, case haplotypes carrying a causal variant tend to cluster together in a clade on the genealogy . In other words, case haplotypes set up different clusters for each distinct causal rare variant on the genealogy. - As we expect case haplotypes carrying a causal variant to be more related than other haplotypes that are not carrying the mutation, the carrier case haplotypes might share longer identical-by-descent (IBD) segments of DNA around the causal rare variant. - The single variant methods perform association testing at each SNV site and they do not take into account the genealogy of a sample. On contrary, tree based methods consider the valuable evolutionary history of the sample. One can, for example, measure the pairwise similarity of these IBD segments between case and control haplotypes. - For example, introduce pairwise statistic that measures the pairwise IBD between the individuals in the sample to check if case haplotypes are sharing more segments of IBD in the vicinity of a causal rare variant. - In their work, they showed that using IBD mapping, one is able to detect association signal with a higher power rather than using the single variant association methods. However, if the sample size is small, the power of the test decreases. - Taking into account the useful ancestral information in gene genealogy, evaluate the performance of tree-based statistic in detecting the association of rare causal variants with disease through a simulation study. - In their analysis, they use true genealogy of the simulated sequences to detect the association across the region. - They use tree-based statistic based on the scaled distance of case and control haplotypes on the genealogy to measure the association and detect the signal. Their proposed statistic worked better than the other statistic evaluated in signal detection. - Following Burkett et al., investigate the performance of localization and detection of association signal across the region by reclassifying the case hapltoypes into their carrier status. - They showed that by reclassifying the case halpotypes based on their true carrier status and using valuable gene geneaology information, their proposed informed Mantel test outperforms significantly among other available methods. # - In this study, we investigate the ability of IBD based methods to detect and localize disease causal variants that lie in a subregion of a candidate genomic region. - As gene geneaology is not known in practice, we estimate it by reconstructing the partitions using sequence data from sample. To reconstruct the partitions, we use the perfectphyloR package introduced in . - We also explore the idea of reclassifying the case haplotypes into carriers versus noncarriers of causal variants, using the idea of genealogical nearest neighbors (GNN) introduced in , to improve performance of these methods. - Through simulation, we compare the ability of the proposed IBD-based methods with two popular association methods to detect the association, and to localize causal variants in a 100kb subregion of a 2Mbp genomic region. - To illustrate the ideas, we start by working through a particular example dataset as a case study. We then perform a simulation study involving 200 datasets to compare the ability of the methods to _detect_ and _localize_ the disease causal region. 1. Area of study 2. Literature review 3. Purpose of this study - Using the idea of identity-by-descent (IBD), we explore the idea of reclassifying the case haplotypes that carry disease causal variants to improve localization (and detection?) of disease causal variants. - To reclassify the case haplotypes, we use the method based on genealogical nearest neighbor (GNN) described in [cite Kelleher et al. 2019]. - We compare the ability of the proposed GNN method with some popular association methods to localize causal variants in a subregion of 2Mbp genomic region. - To illustrate the idea, we start by working through a particular example dataset as a case study for insgiht into the selected association methods. - We then perform a simulation study involving 200 datasets to score the ability of our GNN method to localize the disease causal region.
ABSTRACT - Genealogical tree based methods have potential application in genomic mapping, as sequences with similar trait values may tend to cluster together on a tree at the location of a trait-influencing variant. - We investigate the utility of tree-based approaches to fine map causal variants in three different projects. - In the first project, through coalescent simulation, we compare the ability of several popular methods of association mapping to localize causal variants in a sub-region of a candidate genomic region. - We consider four broad classes of association methods, which we describe as single-variant, pooled-variant, joint-modelling and tree-based, under an additive genetic-risk model. - Our results lend support to the potential of tree-based methods for genetic fine-mapping of disease. - We further identify differentiating case sequences in to their carrier status can improve the fine mapping ability. - In the second project, we develop an R package to dynamically cluster a set of single-nucleotide variant (SNV) sequences. - The resulting reconstructions provide important insight into the local ancestral structure of the sequence data. - Since true genealogy is unknown in reality, our package may useful to researchers seeking insight into the ancestral structure of their sequence data. - In the third project, we apply the methods developed in the second project to investigate the fine mapping ability of tree-based methods for rare variants. - We also pursue the idea of reclassifying case sequences in to their carrier status using the idea of genealogical nearest neighbour. - In this study, we investigate the ability of tree-based methods to fine map rare causal variants and compare it with non-tree-based methods. ################### - Many different tree-based approaches have been developed to detect regions associated with a disease due to a single common variant [e.g. 16–18 ]. - In general, these approaches construct a tree or a set of trees consis- tent with the genotype data and use the predicted tree to define clusters. - Cluster membership is then correlated with disease status. Such cluster-based approaches do not require knowledge of the true disease model (i.e., knowl- edge of the penetrance values is not required).
INTRODUCTION - Identity-by-descent (IBD) is a phenomenon that two or more individuals share similar chunk of DNA sequences from a common ancestor. - IBD mapping is a statistical method for fine mapping disease causal variants that share an IBD segment among unrelated individuals for a disease. - IBD mapping can be considered as a complementary method to genome-wide association studies as single-marker association studies are under powered for rare variants. - In contrast to single-marker association studies, these IBD methods are robust to allelic heterogenity. - In IBD methods, we test the association between the clustering of DNA sequences and the clustering of trait values. - The intuition is that the individuals carrying the same disease-predisposing alleles (?mutations) are likely to share by IBD segments. - On the genealogical tree at a causal variant position, these disease-predisposing alleles are tend to cluster together, and their cluster membership is correlated with disease. - Several studies have been conducted to detect the association of these clustered IBD segments with disease. #####. - For example, Burkett et al. 2014 have described the utility of several tree-based methods to identify multiple rare variants that contribute to a disease. - Also, Karunarathna and Graham 2018 have shown the fine mapping ability of IBD methods compared with several association methods. - They found that classifying case sequences into carriers and non-carriers of causal variants can improve the fine mapping ability of IBD methods. - However, these two studies have focused on when true IBD information is available. #### - In addition, Browning and Thompson 2012 have investigated the power of IBD mapping to detect the association for complex disease. - By computing a pairwise statistic that consists of the rates of IBD in case/case and non-case/case pairs of individuals at each SNV position, they found that IBD mapping has a higher power relative to that of SNV association testing for genome-wide case-control SNV data. - Through out this article, we compare the ability of non-IBD and IBD methods to detect and localize disease causal variants that lie in a subregion of a candidate genomic region. - As non-IBD methods, we consider standard Fisher’s exact test and a sequence kernel association test, known as SKAT-O. - Since the standard Fisher’s exact test is not powerful for rare variant association, we consider SKAT-O as a rare variant association test. - Rare variants association tests can be classified as three categories: burden tests(e.g., VT), variance-component tests (e.g., SKAT, C-alpha) and combined tests. - SKAT-O is a combined rare variant test of both burden test and SKAT. - Therefore, SKAT-O is more robust than burden test and SKAT with the proportion of the causal variants in the genomic region, and the direction of the effects on trait. - For IBD methods, - As gene genealogy is not known in practice, we do not know true IBD clusters that can be used for association. - We therefore, estimate the clustering of sequences by using the methods developed in Chapter 3. - We also explore the idea of reclassifying the case haplotypes into carriers versus noncarriers of causal variants, using the idea of genealogical nearest neighbors (GNN) introduced in , to improve performance of the IBD method. - Through coalescent simulation, we compare the ability of the proposed IBD-based methods with two popular association methods (non-IBD) to detect the association, and to localize causal variants in a 100kb subregion of a 2Mbp genomic region. - To illustrate the ideas, we start by working through a particular example dataset as a case study. We then perform a simulation study involving 200 datasets to compare the ability of the methods to _detect_ and _localize_ the disease causal region. - A number of studies have been done for ex:... - overview of some literature..Kelly’s paper, My paper, Browning and thompson..etc.. ########## - Through out this article, we compare the ability of non-IBD and IBD methods to fine map the rare causal variants. - As IBD methods,
Paragraph 2: - In this thesis, we explore the fine-mapping ability of genealogical tree approaches as three different projects. - In Chapter 2, we compare the fine-mapping ability of several popular association methods to a disease causal region using the true genealogical trees as a reference. - This chapter has been published in the journal of Human Heredity in 2018. - Chapter 3 implements a method to reconstruct partitions of the underlying genealogical tree from SNV haplotypes data. - This chapter has been published in the journal of BMC Bioinformatics in 2019. - This chapter includes a simple example of grouping haplotypes into nested clades which for use in association mapping. - Chapter 4 applies the methods developed in Chapter 3 to the problem of _detecting_ and _localizing_ the disease-causal genomic region and introduces methods to reclassify the case haplotypes based on their estimated carrier status for a causal SNV. - We next describe each of the chapters in more detail.
- COALESCENT TREE - DIPLOID Specie that has paired chromosomes, one inherited from each of two parents. - FASTSIMCOAL2 A C++ program to simulate genetic markers under complex evolutionary models. - FINE MAPPING Determining the genetic variant (or variants) that contribute to complex trait in a genomic region. - HAPLOID Specie having a single set of chromosomes, inherited from a single parent. - HAPLOTYPE Chromosome segment with genetic variation on it. - MSPRIME A Python program to generate coalescent trees for a sample under a range of evolutionary scenario. - PHASING Process of assigning alleles to the paternal and maternal chromosomes to get a pair of haplotypes. - RECOMBINATION A process by which pieces of DNA are broken and recombined to produce new combinations of alleles. - SINGLE NUCLEOTIDE VARIANT Variantion at a given site along the DNA sequence. - IDENTITY BY DESCENT The phenomenon of two or more nucleotide sequences inherit from a common ancestor. -
- In Chapter 4, we explore the ideas of reconstructing genealogical partitions and of reclassifying the case haplotypes to improve localization and detection of disease causal variants. To reconstruct the partitions, we apply the methods developed in Chapter 3 for fine mapping. - With the simulated case-control haplotype data, we use the concepts of gene identity-by-descent (IBD) and SNV association to reclassify case haplotypes into carriers and non-carriers of causal SNVs. - For IBD-based reclassification, we first reconstruct partitions underlying the sample haplotype data using the method developed in Chapter 3. We then compute a pairwise IBD statistic for case and control haplotypes to reclassify the case haplotypes. - For reclassification based on SNV association, we use the number of positively associated alleles to reclassify case haplotypes into carrier and non-carrier status. - We use these reclassification methods to localize and detect the disease causal variants in a sub-region of a candidate genomic region. - We first work through an example dataset for insight into these reclassification methods. - Using a simulation study, we then compare the localization and detection ability of our approaches to some popular association methods.
A perfect phylogeny is a rooted binary tree that recursively partitions DNA sequences. The nested partition structures of a perfect phylogeny provide insight into the pattern of ancestry of DNA sequence data. For example, disease sequences may cluster together in a local partition indicating that they arise from a common ancestral haplotype. The availability of an R package that reconstructs perfect phylogenies should therefore be useful to researchers seeking insight into the ancestral structure of their sequence data. We develop an R package perfectphyloR to reconstruct the local perfect phylogenies underlying a sample of DNA sequences. Our implementation first partitions the DNA sequences using a classic partitioning algorithm and then uses well known heuristics to refine them further. We here briefly demonstrate the reconstruction process and illustrate the major functionality of the package using worked examples.