Authorea

We took the process one step further in order to account for the effects of variable coverage depth. In fact, the heterogeneous nature of coverage depth results from the combination of different effects of biases, starting from the sequence composition (repeated regions, composition bias) which heavily affects the alignment of reads, as well as sequencing errors and artifacts. Even though quality control of the reads library helps reducing these effects, we still observe them to an extent.

To observe how efficiently our model can recover the underlying methylation profile with varying read coverage depth, we have included a simulation module that can create artificial methylation patterns, much like the simulation described in the preceding paragraph ; although this time the \(z_{i}\) values are simulated as well. We use two models for this purpose : the first one is a fixed value depth across the genome, which represents the best case scenario, we vary this values and watch how the accuracy changes. The second is closer to reality: we sample the coverage values using a Poisson distribution with different parameters. (Figure \ref{fig:accuraty_var}b)

We apply the same validation strategy to benchmark the training and parameter estimation. This time we want to observe the final values for both the log-likelihood and the relative entropy between the initial and updated parameter for each instance of coverage depth. (Figure \ref{fig:accuraty_var}b). Unsurprisingly, the estimation gets better the higher the coverage.

Based on the results of these simulation we can conclude that sequencing coverage depth is critical for the model to yield the most accurate inferred profiles. However the loss for lower values is not dramatic and we still obtain estimations above 95% accuracy unless the coverage depth dips to extremely low values (1 to 5). Even though such occurrences sound extreme, we still encounter poorly covered genomic regions. Accounting for this heterogenity would be possible by integrating a sophisticated modeling of the distribution of coverage along the genome, in the spirit of what was done by B. Mirauta \cite{Mirauta_2014} but would require more sophisticated estimation and smoothing techniques.