Discussion
Irrespective of whether we consider microdeletions or gross deletions, the mechanisms underlying pathogenic deletions appear to be strongly influenced by the local DNA sequence environment(Kondrashov & Rogozin, 2004; Krawczak & Cooper, 1991). The role of non-B DNA structures in the formation of cancer-associated deletions as well as deletions in the germline and in mitochondrial sequences has been appreciated for some time(Bacolla, Tainer, Vasquez, & Cooper, 2016; Bacolla, Ye, Ahmed, & Tainer, 2019; Damas et al., 2014; Dong et al., 2014; Fontana & Gahlon, 2020; Pabis, 2021; Svetec Miklenic & Svetec, 2021; Zhao et al., 2010). Such non-B DNA structures often have key regulatory functions in DNA replication and transcription but may also cause genomic instability (Lemmens, van Schendel, & Tijsterman, 2015; Zhao et al., 2010). Furthermore, many deletions in the human genome are mediated by retrotransposon repeat-dependent mechanisms(Fujimoto et al., 2021; Mendez-Dorantes, Tsai, Jahanshir, Lopezcolorado, & Stark, 2020; Morales et al., 2021; Vocke et al., 2021). Similarly, many studies have indicated a role for GC content and DNA motif sequences in the formation of microdeletions and gross deletions(Cooper, Ball, & Mort, 2010; Visser, Shimokawa, Harada, Niikawa, & Matsumoto, 2005). However, the role of these sequence features in the formation of deletions of different lengths has not yet been methodically examined by robust statistical analyses. Meanwhile, the somewhat arbitrary definitions traditionally employed to distinguish between microdeletions and gross deletions have become blurred. We, therefore, collected 42,098 pathogenic deletions that display a length continuum stretching from 1 to 28,394,429 bp, from which we used 40,037 deletions with length <107 bp to perform a comprehensive analysis of the relationship between deletion length and non-B DNA-forming sequences, GC content, specific sequence motifs, and microhomologies.
To our knowledge, this is the first study to demonstrate that very short deletions (\(\leq\)8 bp) have a low probability of co-occurrence with non-B DNA-forming repeats. However, when the deletion length is >8 bp but \(\leq\)30 bp, the non-B DNA-forming repeat frequency neighboring deletion breakpoints is significantly and positively correlated with deletion length (Figure 3). By contrast, no significant correlation was observed between deletion length and repeat frequencies for deletions >30 bp, a finding that distinguishes the complexity of the mechanisms of formation associated with long deletions versus short deletions.
This study confirmed and extended previous observations that deletions of all sizes tend to be concentrated in GC-rich regions of the genome. Indeed, high GC content has been associated with a high level of mutation in general, not just deletions(Abeysinghe et al., 2003; Albano et al., 2010; Kiktev et al., 2018; Zheng et al., 2013). Furthermore, we found that when deletion length was less than 38 bp, the deletion length and GC content were positively correlated; the correlation attained its highest value (PCC=0.87, p=6.0E-10) when the deletion length was\(\leq\)29 bp. A previous study found that increased GC content contributes to the stabilization of non-B DNA structures, thereby enhancing the propensity of deletions to occur(Tanay & Siggia, 2008). This may partially explain our findings that deletion length was positively correlated with both non-B DNA-forming motifs and GC content. A recent study discovered that GC content is associated with both increased and decreased mutation rates depending upon the nucleotide motif(Carlson et al., 2018). Our previous analysis showed that the free energy (∆G) of fold-back structures increases with increasing GC content, and so does the number of SNPs(Abeysinghe et al., 2003; Cooper et al., 2011). The underlying reason may be that the triple bonds of G:C pairs may lead to more stable hairpins, although since GC-rich sequences are also more flexible than AT-rich ones, this may also contribute to relative stability(Abeysinghe et al., 2003; Cooper et al., 2011).
Previous studies have reported the involvement of a number of different sequence motifs in the DNA breakage events leading to microdeletions and microinsertions(Ball et al., 2005). Several studies have been performed pertaining to sequence motifs in the vicinity of large genomic rearrangement breakpoints including also large deletions(Abeysinghe et al., 2003; Dittwald et al., 2013; Férec et al., 2006; Hillmer et al., 2017; Jahic et al., 2017; Visser, Shimokawa, Harada, Kinoshita, et al., 2005; J. Vogt et al., 2014). Here we collected a large number of inherited pathogenic deletions, representing a continuum of lengths from 1 bp to 28,394,429bp, and determined the frequency of occurrence of 78 sequence motifs known to be over- or under-represented in the vicinity of breakpoints or sites of gene conversion in the human genome(Abeysinghe et al., 2003; Ball et al., 2005; Chuzhanova et al., 2009). We found that the sequence motif frequency was significantly and negatively (PCC=-0.62, p=3.2E-2) correlated with deletion length when deletions were \(\leq\)12 bp. However, the relationship between motif frequency and deletion length may well be dependent upon the type of motif in question. As shown in Figures S7-S12, the motif frequencies are distributed quite differently in the vicinity of the deletion breakpoints; thus, further studies are required to identify the underlying reasons responsible for the relationship between deletions and the frequencies of specific motifs.
Here we observed that non-B DNA-forming sequences such as DR, IR, and STR were less abundant at the breakpoints and in breakpoint flanking regions of deletions ≤30 bp than of deletions >30 bp (Figure 4). These repeats may form non-B DNA structures that cause replication stalling followed by replication fork repriming downstream, thereby leading to the deletions, a mechanism described as Fork Stalling and Template Switching (FoSTeS)(Lee et al., 2007). Replication errors mediated by these repeats may more frequently cause deletions >30 bp than deletions ≤30 bp in length. In particular, direct repeats were overrepresented immediately at the breakpoints of deletions >30 bp (Figure 4A), indicative of a specific role for these repeats in deletion formation. Direct repeats may form slipped structures if they are base-paired with the complementary strand in a misaligned fashion, causing hairpins or looped-out bases which may cause replication slippage(Zhao et al., 2010). By contrast, G- quadruplex (GQ) -forming repeats were not overrepresented at the breakpoints of deletions >30 bp (Figure 4B). However, the frequency of GQ-forming repeats was increased in regions flanking the breakpoints of deletions ≤30 bp. The highest frequency of these repeats was observed in regions ~150 bp flanking the breakpoints on both sides (Figure 4B), suggesting the involvement of stem-loop formations and microhomology-mediated break-induced replication (MMBIR) in the deletion process.
In addition to MMBIR, microhomology-mediated end joining (MMEJ) plays an important role in double-strand repair and causes pathogenic deletion and translocation variants in the human genome(McVey & Lee, 2008; Verdin et al., 2013). MMEJ repairs DNA breaks via the use of substantial microhomology and creates precise deletions without insertions or other mutations at the breakpoint. We identified microhomologies within the breakpoint flanking regions of 60% of the HGMD deletions indicating that MMEJ is an important mechanism underlying pathogenic deletions in humans. This is in accord with the findings of Grajcarek et al.(Grajcarek et al., 2019) who identified microhomologies at the breakpoints of 57% of the deletions included in ClinVar. Additionally, we found that more than 42% of the breakpoints flanking regions of short deletions (< 30bp) have microhomologies, somewhat higher than for those (29%) within long deletions. This is the first investigation in comparing the occurrence of microhomologies in short and long deletions.
It is well known that replication-based mechanisms are often involved in the formation of deletions and duplications of various sizes(Ankala et al., 2012; Geng et al., 2021; Seo et al., 2020; Vissers et al., 2009; Zhao et al., 2010). Our findings suggest that these mechanisms also contribute to the formation of pathogenic microdeletions <30 bp and gross deletions ≥30 bp. However, the different frequencies and distribution profiles of non-B DNA-forming sequence motifs at the breakpoints and within breakpoint-flanking regions of both groups of deletions suggest that the replication errors underlying the deletions are induced by different types of non-B DNA structure.
Overall, this study suggests 25-30 bp as a potential threshold that can be used to distinguish gross deletions and microdeletions in terms of their likely underlying mechanisms of mutagenesis. This notional threshold is based on the observation of the correlations between deletion length, non-B DNA-forming repeats frequencies, GC content, and sequence motif frequencies (Figure 7A). For deletion lengths greater than 30 bp, correlations start to weaken, and they tend to disappear at lengths greater than 50 bp. Although establishing a threshold to distinguish gross deletions from microdeletions is to some extent dependent on the intended research purpose, there is value in being able to draw distinctions based upon objective analyses. The approach and results reported here provide a path that should allow us to move away from arbitrary dividing lines and arrive at information-based knowledge concerning the rather different generative mechanisms underlying microdeletions and gross deletions.