2.5.1 epiGBS data processing and filtering
We used the pipeline provided by 71 with a bug-fix
modification
(https://github.com/MWSchmid/epiGBS_Nov_2017_fixed)
to process the sequencing files (sensu 74). First, we
created a de novo reference from the demultiplexed and quality
trimmed sequencing reads. We mapped the reads to the de novoreference and performed strand-specific nucleotide (single nucleotide
polymorphisms, SNPs) and methylation (single methylation polymorphisms,
SMPs) variant calling. We filtered the resulting SNP and SMP files as
follows: i) we initially removed SNPs and SMPs without a minimum
coverage of 3 (i.e. 3 sequencing reads mapping to each locus) and a
maximum coverage equal to the 99th percentile of the
read coverage distribution, in at least one replicate sample per group
ii) we removed samples lacking more than 20% of the SNPs or more than
25% of the SMPs; and iii) after removing these samples from the
experimental design, we used the original SNP and SMP dataset and
removed SNPs and SMPs without a minimum coverage of 10 and a maximum
coverage equal to the 99th percentile of the read
coverage distribution in at least one replicate sample per group.
The unfiltered datasets resulting from the epiGBS pipeline consisted of
279,103 SNPs and 22.3 million SMPs across all 35 samples. The first and
less strict filtering step resulted in 58,773 SNPs and 290,547 SMPs.
During the second step we removed 5 samples with low SNP and SMP
coverage which, in general, had a low number of sequencing reads after
demultiplexing and quality trimming (Table S1). The last and stricter
filtering resulted in 52,513 SNPs and 239,728 SMPs across 30 samples
comprising 2-3 replicates per group (Table S1).
We created a final SMP working dataset by removing: i) SMPs with any
missing values across all 30 samples to obtain a complete SMP matrix (we
discarded 196,039 SMPs); ii) SMPs called on the same cytosine as a SNP
(we removed 324 positions); and iii) monomorphic SMPs, i.e. SMPs with a
methylation frequency ≤ 5% and ≥ 95% across 95% (i.e. n=28) of the
samples. We calculated the methylation level at each SMP in each
individual sample as the number of reads mapping to one position with
methylation divided by the total number of reads mapping to that
position. The final matrix, containing only polymorphic SMPs, consisted
of 3,769 cytosines across 30 samples. Similarly, we removed SNPs with
any missing values across all 30 samples; the final SNP working dataset
consisted of 23,252 SNPs.