Relationship between GC content and deletion length
We next determined the GC content within the 1-kb bins centered at the
breakpoints in the HGMD-deletion dataset and the control1 dataset. As
shown in Figure 5A, the GC content was at its maximum at precisely 1 bp
from the breakpoint. Further, the GC content was invariably higher for
pathogenic deletions than for the control1 dataset (Student’s t-test
p< 2.2E-16). In addition, the GC content distribution
for control1 was remarkably constant irrespective of the breakpoint
location and did not show a peak or valley at the breakpoint. The
average GC content was then determined for deletions of different
lengths. The results are shown in Figure 5B. When the deletion length
was \(\leq\)29 bp, the correlation between deletion length and GC
content reached the highest value, with PCC=0.87 (p=6.0E-10). The GC
content was found to correlate significantly (PCC=0.71 and
p=7.3E-7) with deletion length
when the deletion length was ≤38 bp but not when it was
>38bp. These results suggest that, in relation to GC
content, 29-38 bp represents a potential cut-off that can serve to
divide pathogenic deletions into gross deletions and microdeletions.
When we used either 29 bp or 38 bp as a cut-off to partition the
deletions into two groups, the GC content of the short deletions was
higher than that of the longer deletions at the breakpoint (Figure S5).
Thus, the short and long deletions partitioned by the cut-off exhibit
differences in GC content at the breakpoints.