Bioinformatic prediction of pseudoexon activation
Currently, there is no bioinformatic tool dedicated to prediction of
pseudoexon-activating variants together with the corresponding size
and/or sequence of the inserted cryptic exon. The current prediction
strategy is to determine whether a deep intronic variant leads to ade novo splice site gain, and then separately check for a nearby
pre-existing cryptic splice site of opposite polarity that could define
the boundary of the new exon (Caminsky et al., 2016; Lee et al., 2017).
In the variant prioritization method of Caminsky et al. (2016), an
Information Theory model was used to measure changes in
splicing-relevant protein binding sites and predict whether a variant
would lead to a gain or loss of a splicing motif. A total of 623
variants in hereditary breast and ovarian cancer genes were predicted to
create or strengthen an intronic cryptic splice site. However, only 17
variants were prioritized as likely to create a pseudoexon due to their
location within 250 nucleotides of another existing intronic site of
opposite polarity and the existence of an hnRNPA1 site within five
nucleotides of the acceptor of the predicted pseudoexon (Caminsky et
al., 2016). However, these prioritized variants have yet to undergo
splicing analysis, and so it is not possible to assess the performance
of the Information Theory model.
Another workflow incorporates use of CryptSplice, a tool which extends
the splice site definition of Burge et al. (1999) to capture more
sequence component information (Lee et al., 2017). The donor sequences
extend from seven nucleotides upstream of GT (−7) to six nucleotides
downstream of GT (+6), and acceptor sequences extend from 68 nucleotides
upstream of AG (−68) to 20 nucleotides downstream of AG (+20). This
extended definition was previously reported to improve splice site
prediction by combining the feature information of splicing signals and
SREs around splice sites (J. L. Li, Wang, Wang, Bai, & Yuan, 2012). In
an analysis of CFTR variants in cystic fibrosis patients with
partly explained genetic cause for their recessively inherited disease,
intronic variants underwent prioritization to detect variants that may
lead to pseudoexon activation (Lee et al., 2017). Of 41 candidate
intronic variants predicted to create either donor or acceptor sequences
using CryptSplice, only three donor sequences were additionally
predicted to activate pseudoexons by manual evaluation of the
surrounding sequence for a splice site of opposite polarity (Lee et al.,
2017). Two variants were shown to lead to pseudoexon insertion resulting
in transcript loss due to nonsense-mediated decay; and the other, with a
weakly predicted upstream acceptor, did not lead to aberrant splicing.
In the same study, CryptSplice analysis of 4,685 DKC1 unique
variants present in six individuals identified five candidate donor
sequences and 12 candidate acceptor sequences (Lee et al., 2017). Only
one of the five candidate donors was predicted to activate a pseudoexon;
while mRNA analysis provided evidence for pseudoexonization, the donor
activated by this DKC1 variant did not pair with the CryptSplice
predicted acceptor, but rather with another acceptor 14 nucleotides
upstream (Lee et al., 2017).
The Information Theory and CryptSplice prioritization methods for
pseudoexon-activating variants did not comprehensively take into account
the role of SREs, which can influence the expression of pseudoexons. To
illustrate, the Information Theory model predicted that MLH1LRG_216t1:c.1559-1732A>T creates a new acceptor and
activates a 239-bp pseudoexon due to the presence of a downstream
pre-existing cryptic donor (Caminsky et al., 2016). However, our
analysis of the pseudoexon sequence using HSF revealed a cluster of
putative ESS octamers ((X. H.-F. Zhang & Chasin, 2004), with high
relative activity and located within 30 nucleotides upstream of the
cryptic donor that potentially inactivates this cryptic donor
(Supplementary Figure 2). Therefore, a prediction model that
incorporates both splice site motifs and the distribution of SREs within
candidate pseudoexons and their flanking regions is likely to improve
the accuracy of pseudoexon activation predictions.