authorea.com/48199

Overdispersion control and data simulation with methylKit

Briefly explain DNA Methylation, CpG islands (and maybe mention bisulfite sequencing?)

Explain overdispersion and covariates.

All implementation is based on methylKit (Akalin 2012)

(work in progress)

Multiple new features were integrated into the methylKit R-package: The ability to take into account covariates and overdispersion when calculating differential methylation and the possibility to simulate methylation data for testing and demonstration purposes.

The first new feature - or, more precisely, the first two features - aim to improve the sensitivity of the detection of differentially methylated sites by canceling out sites where the difference is either the result of a covariate and not the primary treatment as well as those sites where differences occur as a result of statistical overdispersion.

Covariates must be supplied by the user and theoretically could encompass metadata such as age, sex, etc. which are generally thought to influence DNA methylation quite significantly (Feng 2014). It follows, that the ability to estimate / eliminate their influence on the calculation of differentially methylated sites (caused by a certain treatment) would be of great value.

This is achieved using two GLMs (more precisely, logistic models): One model is constructed with the treatment as well as the covariates, while a second model is built around only the covariates (if there are other covariates present at all).

If there are covariates present, then the difference of deviances of the covariate-model and the treatment-model are the basis for the F- or Chisq-test.If there are no covariates present, then the difference of the null.deviance and the deviance from the treatment-model are used.

However, before those deviance values are tested, they are corrected for overdispersion, if the user desires this. The basic overdispersion correction works through calculating the pearson residuals of the fitted values from the treatment-model and then calculating from those residuals an overall correction factor \(\phi\): \[\phi = \frac{\sum{r_{i}^2}}{(w - nprm)}\] Where \(r_{i}\) are the pearson residuals, \(w\) is the number of counts for the current site and \(nprm\) is the number of fitted parameters from the treatment-model.

The second new feature that was integrated into methylKit is the possibility to simulate methylation data. A Binomial distribution is used to model the background methylation values for each site. The same background value is used for all samples.

Those fractional background values are then mapped to actual methylation counts. This happens by drawing the counts for each site from a betabinomial distribution that uses the binomial background values as the probability.

All those steps take place for treated and control samples alike and the background probabilities are the same for all samples. The effect of a treatment is added into the treated samples by increasing the background probabilities for a user-specified percentage of the sites. The increase of the background probabilities is also user-specified and can be either a fixed value or a range of percentages, in which case the value for each specific site is chosen randomly.

Furthermore, the user is able to supply a numeric covariate for the data (e.g. the age of the samples), which similarly to the treatment effect, changes the background probabilities for certain sites via the logistic function: \[f(x) = \frac{e^{(\beta_{0} + \beta_{1}x)}}{k+e^{(\beta_{0} + \beta_{1}x)}}\] Where \(x\) is the covariate, while \(\beta_{0}\), \(\beta_{1}\) and \(k\) determine the shape and slope of the logistic curve.

An example of simulated data without covariates is depicted in Fig.[1].

The simulations of BSmooth (Hansen 2012) and the WGBSSuite (Rackham 2015) were used for comparison.

What do we see? Did we expect that?

Simulated data: Chisq-test vs. F-test, overdispersion vs. no overdispersion.

Results of real data: Overdispersion control vs. no overdispersion control.

Comparison to CTCF / H3K9ac markers as biological / functional “control”.

Do we see a peak-loss? If so, what percentage?

## Share on Social Media