2.5.1 epiGBS data processing and filtering
We used the pipeline provided by 71 with a bug-fix modification (https://github.com/MWSchmid/epiGBS_Nov_2017_fixed) to process the sequencing files (sensu 74). First, we created a de novo reference from the demultiplexed and quality trimmed sequencing reads. We mapped the reads to the de novoreference and performed strand-specific nucleotide (single nucleotide polymorphisms, SNPs) and methylation (single methylation polymorphisms, SMPs) variant calling. We filtered the resulting SNP and SMP files as follows: i) we initially removed SNPs and SMPs without a minimum coverage of 3 (i.e. 3 sequencing reads mapping to each locus) and a maximum coverage equal to the 99th percentile of the read coverage distribution, in at least one replicate sample per group ii) we removed samples lacking more than 20% of the SNPs or more than 25% of the SMPs; and iii) after removing these samples from the experimental design, we used the original SNP and SMP dataset and removed SNPs and SMPs without a minimum coverage of 10 and a maximum coverage equal to the 99th percentile of the read coverage distribution in at least one replicate sample per group.
The unfiltered datasets resulting from the epiGBS pipeline consisted of 279,103 SNPs and 22.3 million SMPs across all 35 samples. The first and less strict filtering step resulted in 58,773 SNPs and 290,547 SMPs. During the second step we removed 5 samples with low SNP and SMP coverage which, in general, had a low number of sequencing reads after demultiplexing and quality trimming (Table S1). The last and stricter filtering resulted in 52,513 SNPs and 239,728 SMPs across 30 samples comprising 2-3 replicates per group (Table S1).
We created a final SMP working dataset by removing: i) SMPs with any missing values across all 30 samples to obtain a complete SMP matrix (we discarded 196,039 SMPs); ii) SMPs called on the same cytosine as a SNP (we removed 324 positions); and iii) monomorphic SMPs, i.e. SMPs with a methylation frequency ≤ 5% and ≥ 95% across 95% (i.e. n=28) of the samples. We calculated the methylation level at each SMP in each individual sample as the number of reads mapping to one position with methylation divided by the total number of reads mapping to that position. The final matrix, containing only polymorphic SMPs, consisted of 3,769 cytosines across 30 samples. Similarly, we removed SNPs with any missing values across all 30 samples; the final SNP working dataset consisted of 23,252 SNPs.