Background
Deletions are responsible for many human genetic diseases and together
constitute about 20% of all mutations known to cause human inherited
disease(Stenson et al., 2020). Deletions are associated not only with
common disorders, such as Alzheimer’s disease(Cukier et al., 2016;
Prihar et al., 1999), Parkinson’s disease(Tan, 2016), intellectual
disability(Sharp et al., 2006), autistic spectrum disorders(Sato et al.,
2012; Vaags et al., 2012), and heritable cancers(Guo et al., 2018; Xu et
al., 2012) but also rare or low-frequency diseases(Nambot et al., 2018).
Disease-associated deletions in humans may range in length between 1 bp
up to many thousands or even millions of base-pairs (bp). Historically,
the Human Gene Mutation Database (HGMD) has subdivided genomic deletions
into microdeletions (1-20 bp) and gross deletions (>20
bp)(Stenson et al., 2020), but this distinction was originally made
fairly arbitrarily for reasons of practical utility rather than for any
cogent biological reason. Many studies(Claudia MB Carvalho & James R
Lupski, 2016; Keute et al., 2020; Maranchie et al., 2004; Sahoo et al.,
2006) have suggested the involvement of different mechanisms in the
formation of microdeletions and gross deletions including non-homologous
end-joining (NHEJ), microhomology-mediated end-joining (MMEJ),
non-allelic homologous recombination (NAHR), retrotransposon-mediated
mechanisms, and replication-based errors including fork stalling and
template switching (FoSTeS) and microhomology-mediated break-induced
replication (MMBIR)(Abelleyro et al., 2020; Bauters et al., 2008;
Carvalho et al., 2009; Férec et al., 2006; Gadgil et al., 2020; P.
Hastings, Ira, & Lupski, 2009; P. J. Hastings, Lupski, Rosenberg, &
Ira, 2009; Hu et al., 2019; Lee, Carvalho, & Lupski, 2007; Marey et
al., 2016; Summerer et al., 2018; J. Vogt et al., 2014; Zhang et al.,
2009; Zhang et al., 2010). Jahic et al.(Jahic et al., 2017) have
presented doublet-mediated DNA rearrangements as a mechanism for the
formation of recurrent pathogenic deletions of exon 10 in theSPAST gene. These different mutational mechanisms may be inferred
by the presence of different breakpoint sequence features(Kidd et al.,
2010).
Both gross deletions and microdeletions are non-randomly distributed in
the human genome and are known to be strongly influenced by the local
DNA sequence environment(Del Mundo, Zewail-Foote, Kerwin, & Vasquez,
2017; Georgakopoulos-Soares, Morganella, Jain, Hemberg, & Nik-Zainal,
2018). Previous studies have found that both gross deletions and
microdeletions originate through the formation and resolution of
aberrant DNA secondary structures, and we now know that the process of
secondary structure formation is strongly sequence-mediated(Férec et
al., 2006; Kouzine et al., 2017; Krawczak & Cooper, 1991; Wu et al.,
2014). Previous studies have found that the breakpoints of deletions
often possess a significant number of identical nucleotides, indicating
the involvement of direct repeats(Kato et al., 2008), while replication
slippage is recognized as a common cause of microdeletions(MacLean,
Favaloro, Warne, & Zajac, 2006). A more recent study has revealed that
replication-based mechanisms are frequently involved in gross
duplications and deletions(Ankala et al., 2012; C. M. Carvalho & J. R.
Lupski, 2016; Geng et al., 2021; Marey et al., 2016; Seo et al., 2020).
Analyzing 8,399 microdeletions in 940 genes from HGMD, one early study
found that 81% of microdeletions (<21 bp) were located in the
vicinity of direct, inverted, or mirror repeats(Ball et al., 2005).
Another study attempted to relate the occurrence of microdeletions to
the presence of non-B DNA structures by employing a set of 17,208
microdeletions (defined as being of length <21 bp), and found
that 56% of microdeletions harbored either direct repeats or mirror
repeats near the breakpoints(Kamat, Bacolla, Cooper, & Chuzhanova,
2016). An analysis of 11 gross deletions associated with autosomal
dominant polycystic kidney disease, early-onset Parkinsonism, Menkes
disease, \(\alpha^{+}\)thalassemia, adrenoleukodystrophy, and
hydrocephalus, respectively, concluded that these large deletions were
mediated by negative supercoiling-dependent non-B DNA
conformations(Bacolla et al., 2004). Sequence motifs capable of forming
non-B DNA structures contribute to the genome-wide instability
responsible for both small- and large-scale copy number variants(Brown
& Freudenreich, 2021; Guiblet et al., 2021). Arlt et al.(Arlt et al.,
2009) reported that replication stress induces genome-wide copy number
changes resembling pathogenic deletions and duplications. Most deletion
breakpoint junctions were characterized by microhomologies suggesting
that the deletion breakpoint junctions were formed by non-homologous end
joining (NHEJ) or a replication-coupled process(Seo et al., 2020).
Marely et al.(Marey et al., 2016) illustrated the important role of NHEJ
in the formation of DMD gene deletions.
Different forms of sequence capable of forming non-B DNA structures
predispose certain genomic regions to instability causing pathogenic
rearrangements (Zhao, Bacolla, Wang, & Vasquez, 2010). The relationship
between deletions and non-B DNA structures has been investigated in
terms of the molecular properties of the deletion breakpoints (the
breakpoints being defined as the junctions between the normal and
rearranged DNA sequences)(Bacolla, Wojciechowska, Kosmider, Larson, &
Wells, 2006; Damas, Carneiro, Amorim, & Pereira, 2014; Keegan, Wilton,
& Fletcher, 2019). Verdin et al. identified various genomic
architectural features, including sequence motifs, putative sites of
non-B DNA conformations, and repetitive elements in breakpoint
regions(Verdin et al., 2013). Recurrent gross chromosomal
rearrangements, including large deletions of several hundred kb are
mediated by non-allelic homologous recombination NAHR (Demaerel et al.,
2019; Dittwald et al., 2013; Harel & Lupski, 2018; Hillmer et al.,
2016; Inoue & Lupski, 2002; Liu, Carvalho, Hastings, & Lupski, 2012;
P. H. Vogt et al., 2021). More recently, 8,943 non-pathogenic deletion
breakpoints from 1,092 healthy humans were analyzed, revealing that
NAHR-mediated breakpoints are associated with open chromatin(Abyzov et
al., 2015). To our knowledge, however, no study has been performed that
systematically explores the range of structural features associated
with, and the mechanisms underlying, the full spectrum of human
pathogenic gene deletions of different lengths, extending from the
smallest of microdeletions to gross deletions. Such a study is needed to
determine how microdeletions differ from gross deletions in terms of
their underlying generative mechanisms, and whether there is a natural
threshold or cut-off between these two entities or if they simply form
the discrete ends of a continuum.
Besides a relationship between non-B DNA structure-forming motifs and
deletion mutagenesis, several studies show that increasing GC content is
associated with elevated rates of mutation and recombination(Kiktev,
Sheng, Lobachev, & Petes, 2018; Romiguier, Ranwez, Douzery, & Galtier,
2010). Deletion rates also vary between species in relation to genomic
GC content(Hardison et al., 2003; Lindsay, Rahbari, Kaplanis, Keane, &
Hurles, 2019). A study of eutherian genomes found that increased GC
content was associated with an increase in germline deletion
frequency(Hardison et al., 2003). In a similar vein, an analysis of 33
mammalian genomes found that GC-rich sequences were prone to
deletion(Romiguier et al., 2010). These discoveries have indicated the
importance of GC content in the formation of deletions in several
different contexts. However, all these studies have either been
inter-species comparisons or intra-genome comparisons in healthy humans
and did not investigate pathogenic deletions. Importantly, to our
knowledge, no study has yet investigated the relationship between GC
content and deletion length in a disease context. Thus, here we formally
investigate the relationship between GC content and pathogenic deletion
length.
Various sequence motifs have been reported to be over-represented in the
vicinity of microdeletion breakpoints(Ball et al., 2005). For example,
purine-pyrimidine sequences and polypurine tracts are significantly
enriched in the vicinity of gross gene deletions(Abeysinghe, Chuzhanova,
Krawczak, Ball, & Cooper, 2003). Recurrent large deletion of 1.11-Mb in
14q32.2 is catalyzed by large (TGG)n tandem repeats(Béna et al., 2010).
One study reporting the sequencing of the breakpoint junctions of 30
rare deletions spanning between 91 bp and 14 kb found that most
breakpoints exhibited microhomologies and were associated with specific
sequence motifs(Vissers et al., 2009). Currently, we estimate that at
least 78 sequence motifs have been found to occur at elevated
frequencies in the vicinity of deletion, recombination, or translocation
breakpoints(Abeysinghe et al., 2003; Ball et al., 2005; Chuzhanova et
al., 2009). Ball et al(Ball et al., 2005). reported 30 motifs, including
the heptanucleotide CCCCCTG, DNA polymerase pause sites, and
topoisomerase cleavage sites that occurred frequently near deletion
breakpoints. Chuzhanova et al(Chuzhanova et al., 2009). Presented DNA
sequence motifs are known to be associated with site-specific
cleavage/recombination, gene mutations, and various “super-hotspot
motifs” that were over-represented in the vicinity of microdeletions.
However, to our knowledge, no attempt has as yet been made to analyze a
large set of pathogenic deletions, including both microdeletions and
gross deletions, in order to systematically explore the relationship
between deletion length and occurrence frequency for the different types
of sequence motif residing in the vicinity of breakpoints.
Here, we have performed an analysis of pathogenic gene deletions on two
originally distinct microdeletion and gross deletion datasets from the
Human Gene Mutation Database43. Together, these
comprise 42,098 breakpoints in a total of 3,685 genes. We used simulated
“deletions” matched by length and genomic position as controls. The
purpose of this analysis was to assess the combined datasets in terms of
the frequencies of six types of non-B DNA-forming repeat, GC content,
the frequencies of specific sequence motifs, and microhomologies
neighboring the breakpoints. We propose several possible mechanisms for
the formation of microdeletions and gross deletions. In addition, we
compare generative mechanisms of microdeletions and gross deletions and
suggest a new working definition with which to discriminate between
microdeletions and gross deletions in terms of their size and underlying
mechanisms of formation.