Microhomology analysis for deletions and control1 data
To ascertain microhomologies, we used MHcut, which searches for homologous sequences within the flanking sequences of deletion variants. Of the 15,453 deletions with a minimum size of 3 bp, 40% (6,195) were flanked by microhomologies of at least 3 bp, which is significantly higher than the corresponding probability (7.3%\(\pm\)0.2%) from control1 (t-test P-value <2.2E-6). For the remaining deletions, 59.4% of 1 bp deletions were found with at least 1 bp flanking microhomologies (control1 28.2%\(\pm\) 0.2%), and 71.3% of 2 bp deletions were detected with at least 2 bp flanking microhomologies (control1 8.7%\(\pm\)0.1%), implicating microhomologies as a common enriched characteristic feature of pathogenic deletion breakpoints. When we divided the pathogenic deletions in the HGMD dataset into two groups by using 30 bp as a cutoff, we found that the sequence flanking of 42% deletions with deletions of length <30 bp have microhomologies while 29% sequence flanking of longer deletions have microhomologies. The Chi-square test indicated that the short deletions (length <30 bp) enriched (P-value < 2.2E-16) with microhomologies comparing to the longer deletions. However, there was no significant correlation between the frequency of microhomologies and deletion length.