Inferring genome-wide variation in mutation and selection from polymorphism and divergence data



Write abstract here .

Authors summary for PLOS Genetics


Despite half a century spent scrutinizing levels of naturally occuring polymorphism and, more recently, levels of between species divergence at the molecular level, we know embarrassingly little about which forces explain these patterns ( See a recent review (Leffler ) for an overview of the “old riddle”) although some recent analyses on large comprattive dataset suggest that some life history traits may have a pervasive effect on the amount of neutral nucleotide diversity at a very large phylogenetic scale. What do we know from previous studies is that :

  • . Mutation rate vary substantially throughout the genome and between species divergence for presumaly neutrally evolving regions/sites can reveal that variation;

  • The amount of drift experienced by a particular locus may also vary due to selection at linked sites. There is mounting evidence that this phenomenon is quite widespread although the quantitative effect of either positive and/or negative selection at linked sites has rarely been investiagted quantitatively.

  • Last, the amount of apparent molecular adaptatation experienced by genes varies susbtantially from gene to gene. But There is also substantial variation from site to site in the amount of selection experienced by a single site.

To make progress on this question we propose a statistical framework to infer some of the key evolutionary parameter that drive the joint patterns of polymorphism (within sp) and divergence (between a pair of species). The model is parametrized in a way that allow to estimate jointly the importance of two key evolutionary factors that are central to many population genetics pb but are notoriously difficult to estimate form data: Mutation rates and they variation in the genome The distribution of fitness effects of new mutations Our approach also accounts for demography by jointly estimating nuisance parameters accounting the overall effect of the past (and generally unknown) demographic history of the sample on the expected neutral SFS.

Model and Methods

Notations and assumptions

We envision that data has been gathered on patterns of polymorphism within a species for a set of genome fragments and possibly that divergence was also scored on the same fragments by sequencing an outgroup. We also assume that the ancestral versus derived state of polymorphism called within species can be inferred (possibly with some error see below).

We label the counts of SNPs observed in each frequency categories, respectively as polymorphism neutral sites, potentially selected sites, and divergent sites as \(D=\{P_{{\text{sel}}}(i), P_{{\text{neut}}}(i), D_{{\text{sel}}}, D_{{\text{neut}}}\}\) with \( i \in \{1,..,n-1\}\).

Expectation for counts in the site Frequency Spectrum

Assuming a population at mutation selection drift, equilibrium previous theory as shown that, assuming that SNPs are mutually independent, the counts of vector D can be modeled as a joint series of mutually independent Poisson random variables with means that can be expressed as a function of mutation selection and drift coefficients. Below we express these expectations for each class of polymorphic and divergent site, first in a Wright Fisher at demographic equilibrium and we then relax that assumption.

Neutral Sites

In this case the expectation for both polymorphic and divergent site are expressed in closed form for a fragment of length \(L_{syn}\) neutral nucleotides : \[E[P_{i}^s]=\frac{\theta L_{\text{neut}}}{i}\]

\[E[D_{\text{neut}}]= L_{\text{neut}}\left(\lambda + \frac{\theta}{n}\right)\]

Selected Sites

For selected sites, expectation for both polymorphic and divergent site for a fragment of length \(L_{\text{sel}}\) nucleotides can be expressed by assuming a given scaled selection coefficient \(S = 4 N_e s\) : \[E[P_{\text{sel}}(i)]=\frac{\theta L_{\text{sel}}}{i}\]

\[E[D_{\text{neut}}]= L_{\text{neut}}\left(\lambda + \frac{\theta}{n}\right)\]

Incorporating mutational and selection heterogeneity

Add integral over DFE: We assume that all sites are exchangeable within and between fragments with respect to selection. The sites have a scaled selection coefficient \(S=4 N_{\text{e}}s\) and we assume that \(S\) follows some underlying distribution also known as the distribution of fitness effect (DFE) associated with new mutations. We used three different classes of distributions to model the DFE: Under model A, we use a reflected displaced \(\Gamma\) with mean \(\bar{S}\) shape \(\beta\) and maximum \(S_{\text{max}}\) Under model B a fraction \(p_{\text{b}}\) of mutation are beneficial with fixed scaled coefficient \(S_{\text{b}}\), while a the remaining fraction is a reflected \(\Gamma\) with mean \(\bar{S}\) shape \(\beta\) ( maximum \(S_{\text{max}=0}\)). Under model C, we use identical assumptions to model B but the fraction of beneficial mutation is now distributed as an a negative Exponential with mean \(1/p_b\).

Add integral over the Distribution of mutation rates . .. Here we assume that each fragment is exchangeable with respect to mutation and that the scaled per nucleotide mutation rate \(\theta= 4 N_{e} \mu\) of a given fragment is drawn from an underlying \(\Gamma\) distribution with mean \(\bar{\theta}\) and shape \(\alpha\) that represents mutational heterogeneity. We assume that

Likelihood of the data


Statistical performance of the method

Application exampleto real dataset