A computational method predicting guide RNA activities
According to our results above, quite a fraction of guide RNAs showed
moderate or no activity that about 50% of guide RNAs in the library got
activity scores > 0.1 (lgscore>-1), which
indicates guide RNAs designed without a rational method could not be
successfully used in genome editing. It is time and labor consuming to
test each guide RNA before a gene-editing experiment, thus an in silico
method for guide RNA efficacy prediction is in need, and recently
several works have been developed to facilitate it.[ ]As we firstly
created a dataset of Cpf1 guide RNA activity in prokaryotes, a novel
predicting method based on our results could be a supplement to current
studies and test the generalization ability of those methods based on
eukaryotic datasets.
We filtered our results by removing data of low quality, and established
a dataset based on selective2. We randomly separated the dataset (90%
for training and 10% for testing) in order to avoid overfitting, and
used 10-fold cross-validation to retest the capacity of the trained
model (Fig.2a). Then, we defined a series of featurization considering
the DNA sequence of the protospacer, PAM and fraction of base that
convert each sequence in our library into more than 350 binary and
continuous feature information as inputs. Using features extracted from
each guide RNA, we built a regression model to predict its activity.
(Fig.2b) We re-separated the dataset after randomly shuffling to do the
training 10 times in total, and there was no significant difference at
the performance of all trained models, suggesting that our model is
robust and no bias have been introduced by separating the dataset
(Fig.2c Fig2d).
We compared our predictive results with deepCpf1, the most-cited work,
to evaluate the performance of our model. [36] We found weak
correlation between experiment results and the predictions from
deepCpf1, while our model is much more predictive with Spearman
correlation coefficient of 0.80 on average(Fig.2c Fig2d). It indicated
that the models trained with data from mammalian cells provided limited
comprehension of the guide RNA sequence features contributing to
cleavage activity on the Impact of chromatin structures and the NHEJ
repair pathway.
To verify our founding and explore the mechanism of different predictive
power between two models, we further tested their performance on
forecasting the most efficient guide RNAs as well as the inefficient
ones, since the selection of guide RNA for efficient genome editing is
in critical demand in research and clinical. After we have processed the
data of every sequence in our library using our model and deepCpf1, each
sequence got a prediction score corresponding to its experimental
activity. Then, we accessed the ability of each model to distinguish
efficient guide RNAs from inefficient ones individually, by comparing
the experimentally measured activity scores of a group of sequence in
high prediction scores with which in low scores (Fig2.e). According to
the scores predicted by our model, there was significant difference of
measured activities between predictive high-score group and low-score
group, while no significant difference of which by deepCpf1 in contrary,
confirming that our model has a better predictive capacity Furthermore,
we investigated the reason that the improvement of our model may be
attributed to. This time we divided the test library into high-score
group and low-score group according to their experimental activity and
compared those prediction scores processed by each model (Fig2.f) As
expected, our model performed much better than deeCpf1 in both efficient
and inefficient guide RNA predictions. Interestingly, the deepCpf1 model
was proven to be of almost no ability to predict efficient guide RNAs,
but of weak ability to predict inefficient ones accurately. It revealed
the disability in characterization of high-activity guide RNA as a
primary reason why deepCpf1 underperformed in prokaryotic. We tried to
make an explanation for why deepCpf1 is not sensitive when working with
data from more efficient guide RNA, later. Overall, our model makes
predictions of guide RNA activity better than current approaches and is
extremely good at predicting efficient ones, at least in prokaryotic
where developed.
We next investigated the sequence features contributing to guide RNA
activity. We mainly focus on the sequence composition of protospacer
besides some other factors reported including GC content, melting
temperature. A linear model was used here to plot the coefficients of
position-dependent dimers and trimers respectively(Fig.3a,b,c ). Results
of T7 endonuclease I (T7EI) assay showed that guide RNA sequence
features could affect genome editing in both human cells (Fig. 3d ) . It
is notable that approximately equal effect of dimer/trimer each position
in seed region of protospacer was observed considering their distance
from the PAM, while the first single nucleotide was used to be known as
a stronger factor. We also observed the promotional effect of AH dimers,
AHN trimers and the inhibitory role of GB, TK dimers, GBN trimers at
certain positions. Our findings were in consistence with the results of
another activity profiling screening independent, although we obtained a
larger scale library and provided a more comprehensive and convincing
model.[33] These effects may be attributed to the expression level
or stability of guide RNA as well as the interaction of Cpf1-crRNA
complex with its DNA substrate.