Discussion
Irrespective of whether we consider microdeletions or gross deletions,
the mechanisms underlying pathogenic deletions appear to be strongly
influenced by the local DNA sequence environment(Kondrashov & Rogozin,
2004; Krawczak & Cooper, 1991). The role of
non-B DNA structures in the
formation of cancer-associated deletions as well as deletions in the
germline and in mitochondrial sequences has been appreciated for some
time(Bacolla, Tainer, Vasquez, & Cooper, 2016; Bacolla, Ye, Ahmed, &
Tainer, 2019; Damas et al., 2014; Dong et al., 2014; Fontana & Gahlon,
2020; Pabis, 2021; Svetec Miklenic & Svetec, 2021; Zhao et al., 2010).
Such non-B DNA structures often have key regulatory functions in DNA
replication and transcription but may also cause genomic instability
(Lemmens, van Schendel, & Tijsterman, 2015; Zhao et al., 2010).
Furthermore, many deletions in the human genome are mediated by
retrotransposon repeat-dependent mechanisms(Fujimoto et al., 2021;
Mendez-Dorantes, Tsai, Jahanshir, Lopezcolorado, & Stark, 2020; Morales
et al., 2021; Vocke et al., 2021). Similarly, many studies have
indicated a role for GC content and DNA motif sequences in the formation
of microdeletions and gross deletions(Cooper, Ball, & Mort, 2010;
Visser, Shimokawa, Harada, Niikawa, & Matsumoto, 2005). However, the
role of these sequence features in the formation of deletions of
different lengths has not yet been methodically examined by robust
statistical analyses. Meanwhile, the somewhat arbitrary definitions
traditionally employed to distinguish between microdeletions and gross
deletions have become blurred. We, therefore, collected 42,098
pathogenic deletions that display a length continuum stretching from 1
to 28,394,429 bp, from which we used 40,037 deletions with length
<107 bp to perform a comprehensive analysis of the
relationship between deletion length and non-B DNA-forming sequences, GC
content, specific sequence motifs, and microhomologies.
To our knowledge, this is the first study to demonstrate that very short
deletions (\(\leq\)8 bp) have a low probability of co-occurrence with
non-B DNA-forming repeats. However, when the deletion length is
>8 bp but \(\leq\)30 bp, the non-B DNA-forming repeat
frequency neighboring deletion breakpoints is significantly and
positively correlated with deletion length (Figure 3). By contrast, no
significant correlation was observed between deletion length and repeat
frequencies for deletions >30 bp, a finding that
distinguishes the complexity of the mechanisms of formation associated
with long deletions versus short deletions.
This study confirmed and extended previous observations that deletions
of all sizes tend to be concentrated in GC-rich regions of the genome.
Indeed, high GC content has been associated with a high level of
mutation in general, not just deletions(Abeysinghe et al., 2003; Albano
et al., 2010; Kiktev et al., 2018; Zheng et al., 2013). Furthermore, we
found that when deletion length was less than 38 bp, the deletion length
and GC content were positively correlated; the correlation attained its
highest value (PCC=0.87, p=6.0E-10) when the deletion length was\(\leq\)29 bp. A previous study found that increased GC content
contributes to the stabilization of non-B DNA structures, thereby
enhancing the propensity of deletions to occur(Tanay & Siggia, 2008).
This may partially explain our findings that deletion length was
positively correlated with both non-B DNA-forming motifs and GC content.
A recent study discovered that GC content is associated with both
increased and decreased mutation rates depending upon the nucleotide
motif(Carlson et al., 2018). Our previous analysis showed that the free
energy (∆G) of fold-back structures increases with increasing GC
content, and so does the number of SNPs(Abeysinghe et al., 2003; Cooper
et al., 2011). The underlying reason may be that the triple bonds of G:C
pairs may lead to more stable hairpins, although since GC-rich sequences
are also more flexible than AT-rich ones, this may also contribute to
relative stability(Abeysinghe et al., 2003; Cooper et al., 2011).
Previous studies have reported the involvement of a number of different
sequence motifs in the DNA breakage events leading to microdeletions and
microinsertions(Ball et al., 2005). Several studies have been performed
pertaining to sequence motifs in the vicinity of large genomic
rearrangement breakpoints including also large deletions(Abeysinghe et
al., 2003; Dittwald et al., 2013; Férec et al., 2006; Hillmer et al.,
2017; Jahic et al., 2017; Visser, Shimokawa, Harada, Kinoshita, et al.,
2005; J. Vogt et al., 2014). Here we collected a large number of
inherited pathogenic deletions, representing a continuum of lengths from
1 bp to 28,394,429bp, and determined the frequency of occurrence of 78
sequence motifs known to be over- or under-represented in the vicinity
of breakpoints or sites of gene conversion in the human
genome(Abeysinghe et al., 2003; Ball et al., 2005; Chuzhanova et al.,
2009). We found that the sequence motif frequency was significantly and
negatively (PCC=-0.62, p=3.2E-2) correlated with deletion length when
deletions were \(\leq\)12 bp. However, the relationship between motif
frequency and deletion length may well be dependent upon the type of
motif in question. As shown in Figures S7-S12, the motif frequencies are
distributed quite differently in the vicinity of the deletion
breakpoints; thus, further studies are required to identify the
underlying reasons responsible for the relationship between deletions
and the frequencies of specific motifs.
Here we observed that non-B DNA-forming sequences such as DR, IR, and
STR were less abundant at the breakpoints and in breakpoint flanking
regions of deletions ≤30 bp than of deletions >30 bp
(Figure 4). These repeats may form non-B DNA structures that cause
replication stalling followed by replication fork repriming downstream,
thereby leading to the deletions, a mechanism described as Fork Stalling
and Template Switching (FoSTeS)(Lee et al., 2007). Replication errors
mediated by these repeats may more frequently cause deletions
>30 bp than deletions ≤30 bp in length. In particular,
direct repeats were overrepresented immediately at the breakpoints of
deletions >30 bp (Figure 4A), indicative of a specific role
for these repeats in deletion formation. Direct repeats may form slipped
structures if they are base-paired with the complementary strand in a
misaligned fashion, causing hairpins or looped-out bases which may cause
replication slippage(Zhao et al., 2010). By contrast, G- quadruplex (GQ)
-forming repeats were not overrepresented at the breakpoints of
deletions >30 bp (Figure 4B). However, the frequency of
GQ-forming repeats was increased in regions flanking the breakpoints of
deletions ≤30 bp. The highest frequency of these repeats was observed in
regions ~150 bp flanking the breakpoints on both sides
(Figure 4B), suggesting the involvement of stem-loop formations and
microhomology-mediated break-induced replication (MMBIR) in the deletion
process.
In addition to MMBIR, microhomology-mediated end joining (MMEJ) plays an
important role in double-strand repair and causes pathogenic deletion
and translocation variants in the human genome(McVey & Lee, 2008;
Verdin et al., 2013). MMEJ repairs DNA breaks via the use of substantial
microhomology and creates precise deletions without insertions or other
mutations at the breakpoint. We identified microhomologies within the
breakpoint flanking regions of 60% of the HGMD deletions indicating
that MMEJ is an important mechanism underlying pathogenic deletions in
humans. This is in accord with the findings of Grajcarek et
al.(Grajcarek et al., 2019) who identified microhomologies at the
breakpoints of 57% of the deletions included in ClinVar. Additionally,
we found that more than 42% of the breakpoints flanking regions of
short deletions (< 30bp) have microhomologies, somewhat higher
than for those (29%) within long deletions. This is the first
investigation in comparing the occurrence of microhomologies in short
and long deletions.
It is well known that replication-based mechanisms are often involved in
the formation of deletions and duplications of various sizes(Ankala et
al., 2012; Geng et al., 2021; Seo et al., 2020; Vissers et al., 2009;
Zhao et al., 2010). Our findings suggest that these mechanisms also
contribute to the formation of pathogenic microdeletions <30
bp and gross deletions ≥30 bp. However, the different frequencies and
distribution profiles of non-B DNA-forming sequence motifs at the
breakpoints and within breakpoint-flanking regions of both groups of
deletions suggest that the replication errors underlying the deletions
are induced by different types of non-B DNA structure.
Overall, this study suggests 25-30 bp as a potential threshold that can
be used to distinguish gross deletions and microdeletions in terms of
their likely underlying mechanisms of mutagenesis. This notional
threshold is based on the observation of the correlations between
deletion length, non-B DNA-forming repeats frequencies, GC content, and
sequence motif frequencies (Figure 7A). For deletion lengths greater
than 30 bp, correlations start to weaken, and they tend to disappear at
lengths greater than 50 bp. Although establishing a threshold to
distinguish gross deletions from microdeletions is to some extent
dependent on the intended research purpose, there is value in being able
to draw distinctions based upon objective analyses. The approach and
results reported here provide a path that should allow us to move away
from arbitrary dividing lines and arrive at information-based knowledge
concerning the rather different generative mechanisms underlying
microdeletions and gross deletions.