Motif frequency and deletion length
The motif analysis was performed to determine the frequencies of a
series of specific DNA sequence motifs around the breakpoints of the
pathogenic deletions. In total, 78 motifs (Table S4) were surveyed from
previous publications(Abeysinghe et al., 2003; Ball et al., 2005;
Chuzhanova et al., 2009). For each deletion from the HGMD dataset, we
calculated the motif frequency at each location in 1-kb bins centered at
the breakpoints. Each deletion in the HGMD dataset had 100 simulated
deletions in the control0 dataset, for which we also calculated the
frequency of motifs. Considering all motifs together, we compared the
motif frequencies in the vicinity of the breakpoints of the pathogenic
deletions (HGMD-deletion dataset) to the motif frequencies in the
vicinity of breakpoints in deletions from the control0 dataset. We found
that the motif frequencies flanking the pathogenic breakpoints decreased
gradually with distance from 150 bp to the breakpoint, and then attained
their highest values precisely one base from the breakpoint itself
(Figure S6), reflecting the likely contributions of these motifs to the
formation of the deletions. By contrast, the motif frequencies in the
vicinity of the deletion breakpoints from the control0 dataset were
remarkably similar irrespective of their distances from the breakpoints.
When we considered the frequencies of individual motifs in the vicinity
of breakpoints, the distributions could be classified into four subtypes
(Table S5), “Valleys”, “Peaks”, “M shapes” and “Others” (Figure
S7-S12). In total, 22 motifs were grouped as “Valleys” (Figure S7 and
S8); their frequencies decreased with decreasing distance to the
breakpoints and reached their lowest values at the breakpoints
themselves; 28 motifs were grouped as “Peaks” (Figure S9, Figure S10);
their frequencies increased with decreasing distance to the breakpoints
and reached their highest values precisely at the breakpoints; 14 motifs
were grouped in an “M shape” (Figure S11) being characterized by
frequencies that were distributed as an “M” shaped curve; finally, 11
motifs were grouped as “Others” (Figure S12) and were characterized by
frequencies that were unrelated to distance from the breakpoints. Many
of these “patterns” are exclusive to the pathogenic deletion dataset
and hence may indicate specific sequence differences between both
datasets that are functionally relevant and predispose these regions to
instability.
We counted the frequency of each motif in 10-bp bins centered at each
breakpoint of the HGMD-deletion dataset and the 100 simulated
breakpoints. Then, we calculated the “experience hit” eH-values to
assess the significance of each motif in the vicinity of the control
breakpoints and the average eH-value of this motif over all the deletion
breakpoints in the HGMD-deletion dataset. The eH-value indicates the
number of times the number of the motifs in the vicinity of the
simulated breakpoints of the control dataset was larger than the number
of motifs in the vicinity of the pathogenic deletion breakpoints,
divided by 100. We found that 23 motifs occurred more frequently
(eH-value < 0.05) in 10 bp bins centered at the breakpoints of
the pathogenic deletion dataset than at the breakpoints from the
simulated dataset (Figure 6A). These motifs were ”CTY”,
“RNYNNCNNGYNGKTNYNY”, “GCCCWSSW”, “GCTGGTGG”, “GCWGGWGG”,
“GGAGGTGGGCAGGARG”, “AGAGGTGGGCAGGTGG”,
“GAAAATGAAGCTATTTACCCAGGA”, “TGRRKM”, “CAGR”, “GCS”, “WGGAG”,
“CTGGCG”, “RGAC”, “RAG”, “ACYYMK”, “CCG”, “GTAAGT”,
“CGGCGG”, “TTCTTC”, “CCACCA”, “GCCCCG”, “GGAGAA” (Table 2),
which included four motifs identified by Ball et al.(Ball et al., 2005).
The one-sided Fisher’s exact test was used to examine if the motifs
identified by Ball et al. overrepresented as motifs occurred more
frequently in 10 bp bins centered at the breakpoints of the pathogenic
deletion dataset than at the breakpoints from the simulated dataset. No
significant result was identified with OR = 0.35 and P-value = 0.055. We
calculated the average frequencies of all 78 motifs in 1-kb bins
centered at the deletion breakpoints to explore the relationship between
motif frequency and deletion length (Figure 6B) and identified six
motifs for which the frequencies significantly correlated with deletion
length (PCC>0.7 and p< 1E-6) (Figure S13).