Non-B DNA-forming repeats and deletion breakpoints
A major goal of this work was to ascertain whether gene deletions
causing human inherited disease occur disproportionately at sites that
are capable of adopting non-B DNA structures,
including hairpin and looped-out
bases (direct repeats (DR) and short tandem repeats (STR)), cruciform
(inverted repeats (IR)), mirror repeats (MR), G4 DNA (G-quartets (GQ)),
and left-handed Z-DNA (Z-DNA (Z)). Using criteria defined in previous
studies(Cer et al., 2011; Cer et al., 2013) and in Table S1, we searched
for uninterrupted versions of each type of repeat within a 1-kb window
centered at each deletion breakpoint. We found that most of the
identified repeat sequences were less than 50 bp in length (Figure 1).
As shown in Figure 1, more IR and STR were found in the deletion
flanking sequences than other types of repeats.
We compared the total numbers of repeats within 1-kb bins centered at
the breakpoints for the HGMD-deletion data and the simulated deletion
dataset. All repeats occurred with a higher frequency in the vicinity of
the gross deletions (length >20 bp) than in the control1
dataset (Table 1). However, when we combined the gross deletions and
microdeletions, we found that the numbers of repeats in the individual
DR, IR, MR, STR, and Z DNA categories around the pathogenic deletion
breakpoints were lower than those around the simulated data (Table 1,
Figure 2). Table S2 shows the detailed comparison of frequencies of
different types of non-B DNA-forming repeats in the vicinity of
breakpoints of deletions of different lengths. The frequencies of GQ
around the pathogenic deletion breakpoints were higher than around the
simulated data when the GQ was about 150 bp away from the deletion
breakpoints (Figure 2D). However, when the GQ was close to the deletion
breakpoints, the frequency of this repeat around the pathogenic deletion
breakpoints was lower than around the simulated data (Figure 2D). We
also partitioned the GQs around the breakpoints of deletions into G-rich
GQs (15,931/32,067, 49.68%) and C-rich GQs (16,136/32,067, 50.32%),
and compared their frequencies around pathogenic deletion breakpoints
with the simulated data, control1. We found that the frequencies of C-
and G-rich GQs around breakpoints of pathogenic deletions were rather
similar and generally higher, than around the simulated deletion
breakpoints of control1 (Figure S1A and B).
To ascertain whether we could identify a cut-off that would help to
functionally distinguish gross deletions from microdeletions based on
the occurrence of non-B DNA-forming motifs, we determined the average
frequency of all types of non-B DNA-forming repeat in the 1-kb bins
centered at the deletion breakpoints. As shown in Figure 3A, as the
length of the pathogenic deletions increased, so too did the average
frequency of non-B DNA-forming repeats around the deletion breakpoints.
When the deletion length was \(\leq\)8 bp, the frequency of occurrence
of non-B DNA-forming repeats in the vicinity of deletion breakpoints was
lower than random expectation. Here, only 40,037 deletions shorter than
106 bp in length were analyzed because beyond this length the number of
deletions of each length is less than 4 and the number of deletions is
only 4.9% of the total. When we used a 10 bp sliding window to separate
the deletions into bins and computed the average frequency of non-B
DNA-forming repeats around the deletion breakpoints for the deletions in
each bin, we found that deletion length was positively correlated with
the frequency of non-B DNA-forming repeats but was not significant
(Pearson Correlation Coefficient (PCC)=0.33, p=0.32) (Figure 3B).
We then tested the correlation between deletion length and the frequency
of non-B DNA-forming repeats. When the deletion length was ≤9 bp, the
PCC of deletion length and average non-B DNA-forming repeat frequency
was 0.79 (P-value = 1.10E-2). When the deletion length was less than
≤27 bp, the PCC attained its
maximal value, 0.91 (P-value = 3.39E-11), whereas when the deletion
length was less than ≤30 bp, the PCC was 0.80 (P-value = 9.06E-8)
(Figure 3C). These findings indicate that the non-B DNA-forming repeat
frequency in the vicinity of the breakpoints of deletions ≤ 27 bp in
length was significantly and positively correlated with deletion length.
When the deletion length was >30 bp, no significant
correlation was observed between deletion length and the average non-B
DNA-forming repeat frequency. Thus, we speculate that 30 bp could
represent a natural cut-off that serves to separate the pathogenic
deletions into two relatively distinct (albeit overlapping) groups, with
the larger deletions (with length >30 bp) having more
complicated mechanisms of formation than the shorter deletions.
The relationship between the frequencies of the different types of non-B
DNA-forming repeats and the deletion length is shown in Figure S2. For
G-quadruplex-forming (GQ) sequences, a strong correlation (PCC=0.87,
p=3.48E-10) was observed between deletion length and repeat frequency
when the deletion length was \(\leq\)30 bp. For IR, DR, and STR, strong
correlations (PCC=0.72 and p=1.3E-2, PCC=0.76 and p=5E-6, and PCC=0.73
and p=1.57E-5, respectively) were observed when the deletion length was\(\leq\)11 bp, \(\leq\)27 bp, and \(\leq\)27 bp, respectively. However,
no strong correlation was observed between deletion length and the
average frequencies of MR and Z-DNA-forming repeats. Taken together, for
DR, GQ, and STR the frequencies of these repeats were significantly
correlated with deletion length when the deletions were \(\leq\)30 bp;
for IR, the repeat frequencies were significantly correlated with
deletion length when the deletions were \(\leq\)10 bp. These results
suggest that a more precise cut-off to separate deletions
mechanistically into microdeletions and gross deletions might lie
between 10 bp and 30 bp.
To further investigate the non-B DNA-forming repeat frequency and
distribution in the vicinity of breakpoints of deletions of different
lengths, we used 30 bp as a cut-off to divide the pathogenic deletions
in the HGMD-deletion dataset into gross deletions and microdeletions and
analyzed the frequency of DR, GQ, and STR repeats in the vicinity of the
breakpoints. We observed two frequency peaks of DR and STR repeats for
deletions >30 bp and two frequency valleys for deletions
≤30 bp (Figure 4 A and C). However, no obvious frequency peak or valley
was observed for GQ repeats flanking deletions >30 bp
whereas a valley was found around the breakpoint location of deletions
≤30 bp (Figure 4B). When we divided the GQ repeats into G-rich and
C-rich, we found that the frequencies of G-rich GQ repeats and C-rich GQ
repeats around breakpoints of short and long pathogenic deletions are
close, and show valleys around the breakpoints of deletions with length
≤30 bp (Figure S1C). The underlying reason for the absence of any
obvious frequency peak of GQ repeats for deletions with length
>30 bp appears to be due to the fact that G4 structures
arising from GQ repeats may cause DNA polymerase pausing when associated
with certain short motifs, which in turn promotes short deletions.
Indeed, when we analyzed the probability of co-occurrence of GQ around
deletions with short motifs found at DNA polymerase pause sites
(Supplementary Table S3), 91.1% of the GQs co-occurred with such short
motifs.
We also used 10 bp as a cut-off to divide the deletions into
microdeletions and gross deletions
and to analyze the frequency of IR in the vicinity of breakpoints. The
frequencies of IR repeats showed a peak around the breakpoint of
deletions with length >10 bp, and a valley at the
breakpoint of deletions with length ≤10 bp (Figure 4D). These results
suggest that the deletions separated by a cut-off into two groups had
different properties in terms of the frequencies of non-B DNA-forming
repeats in the vicinity of breakpoints. The patterns observed for the
frequencies of non-B DNA-forming repeats in the vicinity of deletion
breakpoints contrasted with the flat lines seen in controls (Figure 4),
supporting the conclusion that a 30 or 10 bp cut-off can functionally
distinguish microdeletions from gross deletions.
In summary, the frequency and distribution of non-B DNA forming repeats
in the vicinity of pathogenic deletion breakpoints were clearly
different when comparing deletions ≤30 and >30 bp (Figure
4). These differences may represent heterogeneity in the underlying
causative mechanisms responsible for both groups of deletion. For the
breakpoints of deletions ≤30 bp, the number of non-B DNA-forming repeats
increased in the breakpoint flanking regions in a “mirror image”
fashion, suggesting that these breakpoints are either rarely located
within non-B DNA forming sequences or that limited resection occurs
before repair. Nevertheless, the increase in the frequency of these
repeats at breakpoint flanking regions supports the view that non-B DNA
structures induced nearby DNA breakage or polymerase stalling. Indeed, a
comparable pattern of non-B DNA-forming sequences were not observed in
the control dataset or in pathogenic deletions >30 bp.
Rather, the most striking difference between the ≤30 bp and
>30 bp deletions was observed from the distribution of
direct repeats, which exhibited the highest frequency directly at
breakpoints, suggesting replication slippage as the initiating event for
the genetic alteration.