Can we score the deletions so as to separate the gross deletions and microdeletions naturally?
For each deletion, we calculated the non-B DNA-forming repeat frequency, GC content, and motif frequency in the region around it. Subsequently, we obtained the percentile ranking of the deletions in the HGMD-repeat database according to the cumulative non-B DNA-forming repeat frequency, GC content, and motif frequency. Then, each deletion was scored by summing the percentile ranking of the deletion in terms of the frequency of non-B DNA-forming repeats, GC content, and motif frequencies in the HGMD-deletion database. This score was termed the percentile ranking (PR) score. We then investigated the correlation between the PR scores of deletions and the deletion lengths. As shown in Figure 7B, when the deletion length was less than 46 bp, the average PR score for deletions of each length was significantly (PCC = 0.71 and P-value = 4.1E-8) correlated with deletion length. When the deletion length was >46 bp, no significant correlation was observed between the average PR score for deletions of each length and the deletion length. When we investigated the relationship between PR scores and deletion length with respect to repeat frequencies, GC content, and motif frequencies, respectively, we found that the deletion length (<31 bp) was significantly (P-value = 8.8E-9) correlated with the PR scores of non-B DNA-forming repeat frequency, and the deletion length (<47 bp) was significantly (P-value = 5.0E-8) correlated with the PR scores of GC content (Figure S14). These findings suggest that the deletion length around 30-47 bp could serve as a possible natural cutoff to partition microdeletions and gross deletions on the basis of their PR scores calculated from the non-B DNA-forming repeat frequency, GC content, and motif frequency.