Model and Methods

Notations and assumptions

We envision that data has been gathered on patterns of polymorphism within a species for a set of genome fragments and possibly that divergence was also scored on the same fragments by sequencing an outgroup. We also assume that the ancestral versus derived state of polymorphism called within species can be inferred (possibly with some error see below).

We label the counts of SNPs observed in each frequency categories, respectively as polymorphism neutral sites, potentially selected sites, and divergent sites as \(D=\{P_{{\text{sel}}}(i), P_{{\text{neut}}}(i), D_{{\text{sel}}}, D_{{\text{neut}}}\}\) with \( i \in \{1,..,n-1\}\).

Expectation for counts in the site Frequency Spectrum

Assuming a population at mutation selection drift, equilibrium previous theory as shown that, assuming that SNPs are mutually independent, the counts of vector D can be modeled as a joint series of mutually independent Poisson random variables with means that can be expressed as a function of mutation selection and drift coefficients. Below we express these expectations for each class of polymorphic and divergent site, first in a Wright Fisher at demographic equilibrium and we then relax that assumption.

Neutral Sites

In this case the expectation for both polymorphic and divergent site are expressed in closed form for a fragment of length \(L_{syn}\) neutral nucleotides : \[E[P_{i}^s]=\frac{\theta L_{\text{neut}}}{i}\]

\[E[D_{\text{neut}}]= L_{\text{neut}}\left(\lambda + \frac{\theta}{n}\right)\]

Selected Sites

For selected sites, expectation for both polymorphic and divergent site for a fragment of length \(L_{\text{sel}}\) nucleotides can be expressed by assuming a given scaled selection coefficient \(S = 4 N_e s\) : \[E[P_{\text{sel}}(i)]=\frac{\theta L_{\text{sel}}}{i}\]

\[E[D_{\text{neut}}]= L_{\text{neut}}\left(\lambda + \frac{\theta}{n}\right)\]

Incorporating mutational and selection heterogeneity

Add integral over DFE: We assume that all sites are exchangeable within and between fragments with respect to selection. The sites have a scaled selection coefficient \(S=4 N_{\text{e}}s\) and we assume that \(S\) follows some underlying distribution also known as the distribution of fitness effect (DFE) associated with new mutations. We used three different classes of distributions to model the DFE: Under model A, we use a reflected displaced \(\Gamma\) with mean \(\bar{S}\) shape \(\beta\) and maximum \(S_{\text{max}}\) Under model B a fraction \(p_{\text{b}}\) of mutation are beneficial with fixed scaled coefficient \(S_{\text{b}}\), while a the remaining fraction is a reflected \(\Gamma\) with mean \(\bar{S}\) shape \(\beta\) ( maximum \(S_{\text{max}=0}\)). Under model C, we use identical assumptions to model B but the fraction of beneficial mutation is now distributed as an a negative Exponential with mean \(1/p_b\).

Add integral over the Distribution of mutation rates . .. Here we assume that each fragment is exchangeable with respect to mutation and that the scaled per nucleotide mutation rate \(\theta= 4 N_{e} \mu\) of a given fragment is drawn from an underlying \(\Gamma\) distribution with mean \(\bar{\theta}\) and shape \(\alpha\) that represents mutational heterogeneity. We assume that

Likelihood of the data