Motif frequency and deletion length
The motif analysis was performed to determine the frequencies of a series of specific DNA sequence motifs around the breakpoints of the pathogenic deletions. In total, 78 motifs (Table S4) were surveyed from previous publications(Abeysinghe et al., 2003; Ball et al., 2005; Chuzhanova et al., 2009). For each deletion from the HGMD dataset, we calculated the motif frequency at each location in 1-kb bins centered at the breakpoints. Each deletion in the HGMD dataset had 100 simulated deletions in the control0 dataset, for which we also calculated the frequency of motifs. Considering all motifs together, we compared the motif frequencies in the vicinity of the breakpoints of the pathogenic deletions (HGMD-deletion dataset) to the motif frequencies in the vicinity of breakpoints in deletions from the control0 dataset. We found that the motif frequencies flanking the pathogenic breakpoints decreased gradually with distance from 150 bp to the breakpoint, and then attained their highest values precisely one base from the breakpoint itself (Figure S6), reflecting the likely contributions of these motifs to the formation of the deletions. By contrast, the motif frequencies in the vicinity of the deletion breakpoints from the control0 dataset were remarkably similar irrespective of their distances from the breakpoints.
When we considered the frequencies of individual motifs in the vicinity of breakpoints, the distributions could be classified into four subtypes (Table S5), “Valleys”, “Peaks”, “M shapes” and “Others” (Figure S7-S12). In total, 22 motifs were grouped as “Valleys” (Figure S7 and S8); their frequencies decreased with decreasing distance to the breakpoints and reached their lowest values at the breakpoints themselves; 28 motifs were grouped as “Peaks” (Figure S9, Figure S10); their frequencies increased with decreasing distance to the breakpoints and reached their highest values precisely at the breakpoints; 14 motifs were grouped in an “M shape” (Figure S11) being characterized by frequencies that were distributed as an “M” shaped curve; finally, 11 motifs were grouped as “Others” (Figure S12) and were characterized by frequencies that were unrelated to distance from the breakpoints. Many of these “patterns” are exclusive to the pathogenic deletion dataset and hence may indicate specific sequence differences between both datasets that are functionally relevant and predispose these regions to instability.
We counted the frequency of each motif in 10-bp bins centered at each breakpoint of the HGMD-deletion dataset and the 100 simulated breakpoints. Then, we calculated the “experience hit” eH-values to assess the significance of each motif in the vicinity of the control breakpoints and the average eH-value of this motif over all the deletion breakpoints in the HGMD-deletion dataset. The eH-value indicates the number of times the number of the motifs in the vicinity of the simulated breakpoints of the control dataset was larger than the number of motifs in the vicinity of the pathogenic deletion breakpoints, divided by 100. We found that 23 motifs occurred more frequently (eH-value < 0.05) in 10 bp bins centered at the breakpoints of the pathogenic deletion dataset than at the breakpoints from the simulated dataset (Figure 6A). These motifs were ”CTY”, “RNYNNCNNGYNGKTNYNY”, “GCCCWSSW”, “GCTGGTGG”, “GCWGGWGG”, “GGAGGTGGGCAGGARG”, “AGAGGTGGGCAGGTGG”, “GAAAATGAAGCTATTTACCCAGGA”, “TGRRKM”, “CAGR”, “GCS”, “WGGAG”, “CTGGCG”, “RGAC”, “RAG”, “ACYYMK”, “CCG”, “GTAAGT”, “CGGCGG”, “TTCTTC”, “CCACCA”, “GCCCCG”, “GGAGAA” (Table 2), which included four motifs identified by Ball et al.(Ball et al., 2005). The one-sided Fisher’s exact test was used to examine if the motifs identified by Ball et al. overrepresented as motifs occurred more frequently in 10 bp bins centered at the breakpoints of the pathogenic deletion dataset than at the breakpoints from the simulated dataset. No significant result was identified with OR = 0.35 and P-value = 0.055. We calculated the average frequencies of all 78 motifs in 1-kb bins centered at the deletion breakpoints to explore the relationship between motif frequency and deletion length (Figure 6B) and identified six motifs for which the frequencies significantly correlated with deletion length (PCC>0.7 and p< 1E-6) (Figure S13).