Mutation and control datasets
In December 2019, the HGMD(Stenson et al., 2020; Stenson et al., 2014)
Professional release 2019.4
[http://www.hgmd.org] contained 38,725
microdeletions of ≤20 bp and 3,373 gross (>20 bp) deletions
all characterized at base-pair resolution, then constituting about 20%
of all sequence-characterized mutations causing human inherited disease.
These two deletion datasets were collected from the primary literature
in precisely the same way; the 20 bp cut-off employed historically
between microdeletions and gross deletions were entirely arbitrary and
did not influence collation efficiency in any way. For the purposes of
this study, these datasets were merged and together termed the
‘HGMD-deletion dataset’. In total, 42,098 deletions were included in the
HGMD-deletion dataset. Of these deletions, 40,037 (95.1%) have a length\(\leq\)106 bp whilst 2,061 (4.9%) deletions have a length between 107
and 28,394,429 bp. Figure S15 displays the log values of deletion
numbers (length <107 bp) along deletion lengths. Supplementary
Table S6 includes the number of deletions with a specific length.
In order to assess the non-randomness of the HGMD-deletion dataset, we
generated 100 simulated breakpoints for each deletion; these were
randomly sampled within 3000 bp of the upstream region of each
pathogenic deletion breakpoint. This process yielded 4,209,800 random
breakpoints for the HGMD-deletion dataset. Then, according to the
coordinates of the 100 simulated breakpoints, we generated random
deletions that matched each pathogenic deletion in terms of its length.
By centering each simulated breakpoint around a 1-kb bin, we generated a
sequence around the breakpoint and included it in the control0 dataset.
In total, the control0 dataset includes 4,209,800\(\times\ 2\ \)breakpoints and 4,209,800 \(\times 2\) flanking
sequences. By randomly sampling 10 deletions for each pathogenic
deletion from control0, we generated the simulated dataset, termed
control1 that contained 420,980 deletions. If the simulated sequences
contained undefined bases (N), these sequences were excluded from the
analysis, and new random breakpoints and flanking sequences were
generated by resampling. The coordinates of the simulated sequences were
retrieved from a genome sequence file in version hg19 that was
downloaded fromhttps://www.gencodegenes.org/human/.
Supplementary Table S7 shows the coordinates of the control1 dataset.