Background
Deletions are responsible for many human genetic diseases and together constitute about 20% of all mutations known to cause human inherited disease(Stenson et al., 2020). Deletions are associated not only with common disorders, such as Alzheimer’s disease(Cukier et al., 2016; Prihar et al., 1999), Parkinson’s disease(Tan, 2016), intellectual disability(Sharp et al., 2006), autistic spectrum disorders(Sato et al., 2012; Vaags et al., 2012), and heritable cancers(Guo et al., 2018; Xu et al., 2012) but also rare or low-frequency diseases(Nambot et al., 2018). Disease-associated deletions in humans may range in length between 1 bp up to many thousands or even millions of base-pairs (bp). Historically, the Human Gene Mutation Database (HGMD) has subdivided genomic deletions into microdeletions (1-20 bp) and gross deletions (>20 bp)(Stenson et al., 2020), but this distinction was originally made fairly arbitrarily for reasons of practical utility rather than for any cogent biological reason. Many studies(Claudia MB Carvalho & James R Lupski, 2016; Keute et al., 2020; Maranchie et al., 2004; Sahoo et al., 2006) have suggested the involvement of different mechanisms in the formation of microdeletions and gross deletions including non-homologous end-joining (NHEJ), microhomology-mediated end-joining (MMEJ), non-allelic homologous recombination (NAHR), retrotransposon-mediated mechanisms, and replication-based errors including fork stalling and template switching (FoSTeS) and microhomology-mediated break-induced replication (MMBIR)(Abelleyro et al., 2020; Bauters et al., 2008; Carvalho et al., 2009; Férec et al., 2006; Gadgil et al., 2020; P. Hastings, Ira, & Lupski, 2009; P. J. Hastings, Lupski, Rosenberg, & Ira, 2009; Hu et al., 2019; Lee, Carvalho, & Lupski, 2007; Marey et al., 2016; Summerer et al., 2018; J. Vogt et al., 2014; Zhang et al., 2009; Zhang et al., 2010). Jahic et al.(Jahic et al., 2017) have presented doublet-mediated DNA rearrangements as a mechanism for the formation of recurrent pathogenic deletions of exon 10 in theSPAST gene. These different mutational mechanisms may be inferred by the presence of different breakpoint sequence features(Kidd et al., 2010).
Both gross deletions and microdeletions are non-randomly distributed in the human genome and are known to be strongly influenced by the local DNA sequence environment(Del Mundo, Zewail-Foote, Kerwin, & Vasquez, 2017; Georgakopoulos-Soares, Morganella, Jain, Hemberg, & Nik-Zainal, 2018). Previous studies have found that both gross deletions and microdeletions originate through the formation and resolution of aberrant DNA secondary structures, and we now know that the process of secondary structure formation is strongly sequence-mediated(Férec et al., 2006; Kouzine et al., 2017; Krawczak & Cooper, 1991; Wu et al., 2014). Previous studies have found that the breakpoints of deletions often possess a significant number of identical nucleotides, indicating the involvement of direct repeats(Kato et al., 2008), while replication slippage is recognized as a common cause of microdeletions(MacLean, Favaloro, Warne, & Zajac, 2006). A more recent study has revealed that replication-based mechanisms are frequently involved in gross duplications and deletions(Ankala et al., 2012; C. M. Carvalho & J. R. Lupski, 2016; Geng et al., 2021; Marey et al., 2016; Seo et al., 2020). Analyzing 8,399 microdeletions in 940 genes from HGMD, one early study found that 81% of microdeletions (<21 bp) were located in the vicinity of direct, inverted, or mirror repeats(Ball et al., 2005). Another study attempted to relate the occurrence of microdeletions to the presence of non-B DNA structures by employing a set of 17,208 microdeletions (defined as being of length <21 bp), and found that 56% of microdeletions harbored either direct repeats or mirror repeats near the breakpoints(Kamat, Bacolla, Cooper, & Chuzhanova, 2016). An analysis of 11 gross deletions associated with autosomal dominant polycystic kidney disease, early-onset Parkinsonism, Menkes disease, \(\alpha^{+}\)thalassemia, adrenoleukodystrophy, and hydrocephalus, respectively, concluded that these large deletions were mediated by negative supercoiling-dependent non-B DNA conformations(Bacolla et al., 2004). Sequence motifs capable of forming non-B DNA structures contribute to the genome-wide instability responsible for both small- and large-scale copy number variants(Brown & Freudenreich, 2021; Guiblet et al., 2021). Arlt et al.(Arlt et al., 2009) reported that replication stress induces genome-wide copy number changes resembling pathogenic deletions and duplications. Most deletion breakpoint junctions were characterized by microhomologies suggesting that the deletion breakpoint junctions were formed by non-homologous end joining (NHEJ) or a replication-coupled process(Seo et al., 2020). Marely et al.(Marey et al., 2016) illustrated the important role of NHEJ in the formation of DMD gene deletions.
Different forms of sequence capable of forming non-B DNA structures predispose certain genomic regions to instability causing pathogenic rearrangements (Zhao, Bacolla, Wang, & Vasquez, 2010). The relationship between deletions and non-B DNA structures has been investigated in terms of the molecular properties of the deletion breakpoints (the breakpoints being defined as the junctions between the normal and rearranged DNA sequences)(Bacolla, Wojciechowska, Kosmider, Larson, & Wells, 2006; Damas, Carneiro, Amorim, & Pereira, 2014; Keegan, Wilton, & Fletcher, 2019). Verdin et al. identified various genomic architectural features, including sequence motifs, putative sites of non-B DNA conformations, and repetitive elements in breakpoint regions(Verdin et al., 2013). Recurrent gross chromosomal rearrangements, including large deletions of several hundred kb are mediated by non-allelic homologous recombination NAHR (Demaerel et al., 2019; Dittwald et al., 2013; Harel & Lupski, 2018; Hillmer et al., 2016; Inoue & Lupski, 2002; Liu, Carvalho, Hastings, & Lupski, 2012; P. H. Vogt et al., 2021). More recently, 8,943 non-pathogenic deletion breakpoints from 1,092 healthy humans were analyzed, revealing that NAHR-mediated breakpoints are associated with open chromatin(Abyzov et al., 2015). To our knowledge, however, no study has been performed that systematically explores the range of structural features associated with, and the mechanisms underlying, the full spectrum of human pathogenic gene deletions of different lengths, extending from the smallest of microdeletions to gross deletions. Such a study is needed to determine how microdeletions differ from gross deletions in terms of their underlying generative mechanisms, and whether there is a natural threshold or cut-off between these two entities or if they simply form the discrete ends of a continuum.
Besides a relationship between non-B DNA structure-forming motifs and deletion mutagenesis, several studies show that increasing GC content is associated with elevated rates of mutation and recombination(Kiktev, Sheng, Lobachev, & Petes, 2018; Romiguier, Ranwez, Douzery, & Galtier, 2010). Deletion rates also vary between species in relation to genomic GC content(Hardison et al., 2003; Lindsay, Rahbari, Kaplanis, Keane, & Hurles, 2019). A study of eutherian genomes found that increased GC content was associated with an increase in germline deletion frequency(Hardison et al., 2003). In a similar vein, an analysis of 33 mammalian genomes found that GC-rich sequences were prone to deletion(Romiguier et al., 2010). These discoveries have indicated the importance of GC content in the formation of deletions in several different contexts. However, all these studies have either been inter-species comparisons or intra-genome comparisons in healthy humans and did not investigate pathogenic deletions. Importantly, to our knowledge, no study has yet investigated the relationship between GC content and deletion length in a disease context. Thus, here we formally investigate the relationship between GC content and pathogenic deletion length.
Various sequence motifs have been reported to be over-represented in the vicinity of microdeletion breakpoints(Ball et al., 2005). For example, purine-pyrimidine sequences and polypurine tracts are significantly enriched in the vicinity of gross gene deletions(Abeysinghe, Chuzhanova, Krawczak, Ball, & Cooper, 2003). Recurrent large deletion of 1.11-Mb in 14q32.2 is catalyzed by large (TGG)n tandem repeats(Béna et al., 2010). One study reporting the sequencing of the breakpoint junctions of 30 rare deletions spanning between 91 bp and 14 kb found that most breakpoints exhibited microhomologies and were associated with specific sequence motifs(Vissers et al., 2009). Currently, we estimate that at least 78 sequence motifs have been found to occur at elevated frequencies in the vicinity of deletion, recombination, or translocation breakpoints(Abeysinghe et al., 2003; Ball et al., 2005; Chuzhanova et al., 2009). Ball et al(Ball et al., 2005). reported 30 motifs, including the heptanucleotide CCCCCTG, DNA polymerase pause sites, and topoisomerase cleavage sites that occurred frequently near deletion breakpoints. Chuzhanova et al(Chuzhanova et al., 2009). Presented DNA sequence motifs are known to be associated with site-specific cleavage/recombination, gene mutations, and various “super-hotspot motifs” that were over-represented in the vicinity of microdeletions. However, to our knowledge, no attempt has as yet been made to analyze a large set of pathogenic deletions, including both microdeletions and gross deletions, in order to systematically explore the relationship between deletion length and occurrence frequency for the different types of sequence motif residing in the vicinity of breakpoints.
Here, we have performed an analysis of pathogenic gene deletions on two originally distinct microdeletion and gross deletion datasets from the Human Gene Mutation Database43. Together, these comprise 42,098 breakpoints in a total of 3,685 genes. We used simulated “deletions” matched by length and genomic position as controls. The purpose of this analysis was to assess the combined datasets in terms of the frequencies of six types of non-B DNA-forming repeat, GC content, the frequencies of specific sequence motifs, and microhomologies neighboring the breakpoints. We propose several possible mechanisms for the formation of microdeletions and gross deletions. In addition, we compare generative mechanisms of microdeletions and gross deletions and suggest a new working definition with which to discriminate between microdeletions and gross deletions in terms of their size and underlying mechanisms of formation.