Relationship between GC content and deletion length
We next determined the GC content within the 1-kb bins centered at the breakpoints in the HGMD-deletion dataset and the control1 dataset. As shown in Figure 5A, the GC content was at its maximum at precisely 1 bp from the breakpoint. Further, the GC content was invariably higher for pathogenic deletions than for the control1 dataset (Student’s t-test p< 2.2E-16). In addition, the GC content distribution for control1 was remarkably constant irrespective of the breakpoint location and did not show a peak or valley at the breakpoint. The average GC content was then determined for deletions of different lengths. The results are shown in Figure 5B. When the deletion length was \(\leq\)29 bp, the correlation between deletion length and GC content reached the highest value, with PCC=0.87 (p=6.0E-10). The GC content was found to correlate significantly (PCC=0.71 and p=7.3E-7) with deletion length when the deletion length was ≤38 bp but not when it was >38bp. These results suggest that, in relation to GC content, 29-38 bp represents a potential cut-off that can serve to divide pathogenic deletions into gross deletions and microdeletions. When we used either 29 bp or 38 bp as a cut-off to partition the deletions into two groups, the GC content of the short deletions was higher than that of the longer deletions at the breakpoint (Figure S5). Thus, the short and long deletions partitioned by the cut-off exhibit differences in GC content at the breakpoints.