Mutation and control datasets
In December 2019, the HGMD(Stenson et al., 2020; Stenson et al., 2014) Professional release 2019.4 [http://www.hgmd.org] contained 38,725 microdeletions of ≤20 bp and 3,373 gross (>20 bp) deletions all characterized at base-pair resolution, then constituting about 20% of all sequence-characterized mutations causing human inherited disease. These two deletion datasets were collected from the primary literature in precisely the same way; the 20 bp cut-off employed historically between microdeletions and gross deletions were entirely arbitrary and did not influence collation efficiency in any way. For the purposes of this study, these datasets were merged and together termed the ‘HGMD-deletion dataset’. In total, 42,098 deletions were included in the HGMD-deletion dataset. Of these deletions, 40,037 (95.1%) have a length\(\leq\)106 bp whilst 2,061 (4.9%) deletions have a length between 107 and 28,394,429 bp. Figure S15 displays the log values of deletion numbers (length <107 bp) along deletion lengths. Supplementary Table S6 includes the number of deletions with a specific length.
In order to assess the non-randomness of the HGMD-deletion dataset, we generated 100 simulated breakpoints for each deletion; these were randomly sampled within 3000 bp of the upstream region of each pathogenic deletion breakpoint. This process yielded 4,209,800 random breakpoints for the HGMD-deletion dataset. Then, according to the coordinates of the 100 simulated breakpoints, we generated random deletions that matched each pathogenic deletion in terms of its length. By centering each simulated breakpoint around a 1-kb bin, we generated a sequence around the breakpoint and included it in the control0 dataset. In total, the control0 dataset includes 4,209,800\(\times\ 2\ \)breakpoints and 4,209,800 \(\times 2\) flanking sequences. By randomly sampling 10 deletions for each pathogenic deletion from control0, we generated the simulated dataset, termed control1 that contained 420,980 deletions. If the simulated sequences contained undefined bases (N), these sequences were excluded from the analysis, and new random breakpoints and flanking sequences were generated by resampling. The coordinates of the simulated sequences were retrieved from a genome sequence file in version hg19 that was downloaded fromhttps://www.gencodegenes.org/human/. Supplementary Table S7 shows the coordinates of the control1 dataset.