Specific sequence motifs in deletion flanking sequences
From previous publications(Abeysinghe et al., 2003; Ball et al., 2005; Chuzhanova et al., 2009), we collected a total of 78 sequence motifs (Table S4) that have been reported to occur in the vicinity of deletion/rearrangement breakpoints and are thought to play a role in the breakage and rejoining of DNA molecules. Briefly, Abeysinghe et al . (2003)(Abeysinghe et al., 2003) listed 36 sequence motifs known to be associated with site-specific recombination, mutation, and DNA cleavage. In their later study, Ball et al. (2005)(Ball et al., 2005) collected an additional 24 sequence motifs thought to be involved in site-specific recombination and putative deletion/insertion hotspots. Finally, Chuzhanova et al. (2009)(Chuzhanova et al., 2009) reported 18 further motifs associated with deletions and recombination. We computed the frequency for each motif in the 1 kb-long sequences flanking the pathogenic deletions from the HGMD-deletion dataset and in the control0 dataset using the R package Biostrings(Gentleman & DebRoy, 2019). We utilized the simulated deletions to determine whether the number of any type of motif in the vicinity of each breakpoint was higher than expected by computing an “experience hit” (eH-value), i.e., the number of times the number of the motifs in the vicinity of the simulated breakpoints of the control dataset was larger than the number of motifs in the vicinity of the pathogenic deletion breakpoints, divided by 100. The relationship between deletion length and motif frequency was then explored by calculating the average motif frequency for each deletion length.