Figures
Figure 1 . Repeat length distribution in all 1-kb bins centered at the breakpoints of the HGMD-deletion data. “DR”, “GQ”, “IR”, “MR”, “STR”, and “Z” denote direct repeats, G-quadruplex-forming, inverted repeats, mirror repeats, short tandem repeats, and Z-DNA, respectively.
Figure 2. Frequency of non-B DNA forming repeats occurring near the breakpoints of the HGMD-deletion dataset. X-axis represents the position relative to the breakpoint and Y axis is the repeat frequency. A-F is the frequency for direct repeats (DR), inverted repeats (IR), mirror repeats (MR), G-quadruplex-forming (GQ), short tandem repeats (STR), and Z DNA sequence, respectively. This frequency refers to the proportion of sequences with repeats at each location.
Figure 3. Relationship between deletion length and average non-B DNA-forming repeat frequency. A. The relationship between deletion length and average repeat frequency within a 1-kb bin of breakpoints. B. Correlation were observed between deletion length and the average repeat frequency for each 10-bp bins of deletion lengths. C. Significant correlations were observed between deletion length and repeat frequency in 1-kb sequence centered at breakpoints by different cut-offs for deletions with length ≤9 bp, ≤27 bp, and ≤30 bp, respectively.
Figure 4. Repeats frequency occurring near the breakpoints of deletions of different length. A-D are the average frequencies of direct repeats (DR), G-quadruplex-forming (QG), short tandem repeats (STR), and inverted repeats (IR), respectively.
Figure 5. GC content in the vicinity of breakpoints of deletions and the relationship between deletion length and GC content. A. GC content in the vicinity of all the pathogenic deletion breakpoints and the simulated data. B. Relationship between deletion length and GC content. When deletion length was less than 38 bp, it was significantly correlated with GC content (PCC = 0.71 and P-value = 7.3E-7).
Figure 6 . Sequence motifs around the breakpoints of deletions. A. eH-values for the difference between frequencies of motif occurrence in 10-bp bins centered at breakpoints of the deletion data and the simulated data; we found that 16 motifs occurred more frequently (eH-value < 0.01) in 10 bp bins centered at the breakpoints of the pathogenic deletion breakpoints than in 10 bp bins centred at the breakpoints of the control dataset including simulated breakpoints. B. Relationship between deletion length and average motif frequency; Each point represents the average motif frequency occurring in the vicinity of deletions with a certain length.
Figure 7. The Pearson Correlation Coefficient (PCC) and PR scores for motif frequency, GC content, or repeat frequency against deletion length. A. Distribution of PCC against deletion length. The PCC values represent the correlations between deletion length and motif frequency, GC content, or repeat frequency. B. Relationship between deletion length and PR score.