Non-B DNA-forming repeats and deletion breakpoints
A major goal of this work was to ascertain whether gene deletions causing human inherited disease occur disproportionately at sites that are capable of adopting non-B DNA structures, including hairpin and looped-out bases (direct repeats (DR) and short tandem repeats (STR)), cruciform (inverted repeats (IR)), mirror repeats (MR), G4 DNA (G-quartets (GQ)), and left-handed Z-DNA (Z-DNA (Z)). Using criteria defined in previous studies(Cer et al., 2011; Cer et al., 2013) and in Table S1, we searched for uninterrupted versions of each type of repeat within a 1-kb window centered at each deletion breakpoint. We found that most of the identified repeat sequences were less than 50 bp in length (Figure 1). As shown in Figure 1, more IR and STR were found in the deletion flanking sequences than other types of repeats.
We compared the total numbers of repeats within 1-kb bins centered at the breakpoints for the HGMD-deletion data and the simulated deletion dataset. All repeats occurred with a higher frequency in the vicinity of the gross deletions (length >20 bp) than in the control1 dataset (Table 1). However, when we combined the gross deletions and microdeletions, we found that the numbers of repeats in the individual DR, IR, MR, STR, and Z DNA categories around the pathogenic deletion breakpoints were lower than those around the simulated data (Table 1, Figure 2). Table S2 shows the detailed comparison of frequencies of different types of non-B DNA-forming repeats in the vicinity of breakpoints of deletions of different lengths. The frequencies of GQ around the pathogenic deletion breakpoints were higher than around the simulated data when the GQ was about 150 bp away from the deletion breakpoints (Figure 2D). However, when the GQ was close to the deletion breakpoints, the frequency of this repeat around the pathogenic deletion breakpoints was lower than around the simulated data (Figure 2D). We also partitioned the GQs around the breakpoints of deletions into G-rich GQs (15,931/32,067, 49.68%) and C-rich GQs (16,136/32,067, 50.32%), and compared their frequencies around pathogenic deletion breakpoints with the simulated data, control1. We found that the frequencies of C- and G-rich GQs around breakpoints of pathogenic deletions were rather similar and generally higher, than around the simulated deletion breakpoints of control1 (Figure S1A and B).
To ascertain whether we could identify a cut-off that would help to functionally distinguish gross deletions from microdeletions based on the occurrence of non-B DNA-forming motifs, we determined the average frequency of all types of non-B DNA-forming repeat in the 1-kb bins centered at the deletion breakpoints. As shown in Figure 3A, as the length of the pathogenic deletions increased, so too did the average frequency of non-B DNA-forming repeats around the deletion breakpoints. When the deletion length was \(\leq\)8 bp, the frequency of occurrence of non-B DNA-forming repeats in the vicinity of deletion breakpoints was lower than random expectation. Here, only 40,037 deletions shorter than 106 bp in length were analyzed because beyond this length the number of deletions of each length is less than 4 and the number of deletions is only 4.9% of the total. When we used a 10 bp sliding window to separate the deletions into bins and computed the average frequency of non-B DNA-forming repeats around the deletion breakpoints for the deletions in each bin, we found that deletion length was positively correlated with the frequency of non-B DNA-forming repeats but was not significant (Pearson Correlation Coefficient (PCC)=0.33, p=0.32) (Figure 3B).
We then tested the correlation between deletion length and the frequency of non-B DNA-forming repeats. When the deletion length was ≤9 bp, the PCC of deletion length and average non-B DNA-forming repeat frequency was 0.79 (P-value = 1.10E-2). When the deletion length was less than ≤27 bp, the PCC attained its maximal value, 0.91 (P-value = 3.39E-11), whereas when the deletion length was less than ≤30 bp, the PCC was 0.80 (P-value = 9.06E-8) (Figure 3C). These findings indicate that the non-B DNA-forming repeat frequency in the vicinity of the breakpoints of deletions ≤ 27 bp in length was significantly and positively correlated with deletion length. When the deletion length was >30 bp, no significant correlation was observed between deletion length and the average non-B DNA-forming repeat frequency. Thus, we speculate that 30 bp could represent a natural cut-off that serves to separate the pathogenic deletions into two relatively distinct (albeit overlapping) groups, with the larger deletions (with length >30 bp) having more complicated mechanisms of formation than the shorter deletions.
The relationship between the frequencies of the different types of non-B DNA-forming repeats and the deletion length is shown in Figure S2. For G-quadruplex-forming (GQ) sequences, a strong correlation (PCC=0.87, p=3.48E-10) was observed between deletion length and repeat frequency when the deletion length was \(\leq\)30 bp. For IR, DR, and STR, strong correlations (PCC=0.72 and p=1.3E-2, PCC=0.76 and p=5E-6, and PCC=0.73 and p=1.57E-5, respectively) were observed when the deletion length was\(\leq\)11 bp, \(\leq\)27 bp, and \(\leq\)27 bp, respectively. However, no strong correlation was observed between deletion length and the average frequencies of MR and Z-DNA-forming repeats. Taken together, for DR, GQ, and STR the frequencies of these repeats were significantly correlated with deletion length when the deletions were \(\leq\)30 bp; for IR, the repeat frequencies were significantly correlated with deletion length when the deletions were \(\leq\)10 bp. These results suggest that a more precise cut-off to separate deletions mechanistically into microdeletions and gross deletions might lie between 10 bp and 30 bp.
To further investigate the non-B DNA-forming repeat frequency and distribution in the vicinity of breakpoints of deletions of different lengths, we used 30 bp as a cut-off to divide the pathogenic deletions in the HGMD-deletion dataset into gross deletions and microdeletions and analyzed the frequency of DR, GQ, and STR repeats in the vicinity of the breakpoints. We observed two frequency peaks of DR and STR repeats for deletions >30 bp and two frequency valleys for deletions ≤30 bp (Figure 4 A and C). However, no obvious frequency peak or valley was observed for GQ repeats flanking deletions >30 bp whereas a valley was found around the breakpoint location of deletions ≤30 bp (Figure 4B). When we divided the GQ repeats into G-rich and C-rich, we found that the frequencies of G-rich GQ repeats and C-rich GQ repeats around breakpoints of short and long pathogenic deletions are close, and show valleys around the breakpoints of deletions with length ≤30 bp (Figure S1C). The underlying reason for the absence of any obvious frequency peak of GQ repeats for deletions with length >30 bp appears to be due to the fact that G4 structures arising from GQ repeats may cause DNA polymerase pausing when associated with certain short motifs, which in turn promotes short deletions. Indeed, when we analyzed the probability of co-occurrence of GQ around deletions with short motifs found at DNA polymerase pause sites (Supplementary Table S3), 91.1% of the GQs co-occurred with such short motifs.
We also used 10 bp as a cut-off to divide the deletions into microdeletions and gross deletions and to analyze the frequency of IR in the vicinity of breakpoints. The frequencies of IR repeats showed a peak around the breakpoint of deletions with length >10 bp, and a valley at the breakpoint of deletions with length ≤10 bp (Figure 4D). These results suggest that the deletions separated by a cut-off into two groups had different properties in terms of the frequencies of non-B DNA-forming repeats in the vicinity of breakpoints. The patterns observed for the frequencies of non-B DNA-forming repeats in the vicinity of deletion breakpoints contrasted with the flat lines seen in controls (Figure 4), supporting the conclusion that a 30 or 10 bp cut-off can functionally distinguish microdeletions from gross deletions.
In summary, the frequency and distribution of non-B DNA forming repeats in the vicinity of pathogenic deletion breakpoints were clearly different when comparing deletions ≤30 and >30 bp (Figure 4). These differences may represent heterogeneity in the underlying causative mechanisms responsible for both groups of deletion. For the breakpoints of deletions ≤30 bp, the number of non-B DNA-forming repeats increased in the breakpoint flanking regions in a “mirror image” fashion, suggesting that these breakpoints are either rarely located within non-B DNA forming sequences or that limited resection occurs before repair. Nevertheless, the increase in the frequency of these repeats at breakpoint flanking regions supports the view that non-B DNA structures induced nearby DNA breakage or polymerase stalling. Indeed, a comparable pattern of non-B DNA-forming sequences were not observed in the control dataset or in pathogenic deletions >30 bp. Rather, the most striking difference between the ≤30 bp and >30 bp deletions was observed from the distribution of direct repeats, which exhibited the highest frequency directly at breakpoints, suggesting replication slippage as the initiating event for the genetic alteration.