Many sRNA ORFs are translated and expressed as peptides

The predictions of protein-coding ORFs reflect evidence for conservation of a protein-coding function, which implies as a pre-requisite expression at the RNA level and protein level. Any annotated sRNA must have evidence for its expression at the RNA level, so all sRNA ORFs have the potential to be translated into peptides. Therefore, as an independent experimental validation of our predictions, we looked for experimental evidence of translation from two types of data: ribosome profiling, showing the translation of transcripts, and mass spectrometry, showing the accumulation of protein to detectable levels.

We used ribosome profiling data for B. subtilis 168 and E. coli K12 from \cite{Li:2012dc} to annotate translated sRNA ORFs. Looking for signal accumulating on either the start or stop codon of ORFs not overlapping annotated coding ORFs, we found evidence for translation of 132 out of 397 B. subtilis sRNA ORFs, compared to on average \(62\pm 11\) expected by chance (based on the translation of mock ORFs in regions not annotated as coding); in E. coli 54 out of 84 ORFs had evidence for translation compared to \(42\pm 8.0\) by chance (Figure \ref{fig:ribosome}A). Because the ribosome profiling data was strand-specific, we could also test the translation of ORFs in the antisense strand to annotated coding ORFs. In this case, the numbers were 5 of 57 ORFs compared to \(1.8\pm 3.0\) by chance in B. subtilis, and 21 of 45 E. coli ORFs compared to \(3.9\pm 4.2\) by chance. In all, this corresponds to 73 and 29 more unannotated sRNA ORFs translated than expected by chance, respectively.

Some sRNA ORFs with a protein-coding function may not be included in this set because they are only be transcribed and translated under certain condtions, and conversely, translation does not necessarily imply function. Nevertheless, there should be significant overlap between these sets since translation is a prerequisite for protein-encoded function. Indeed, more translated ORFs had more high coding scores than non-translated ORFs, meaning the coding score statistic was able to predict which sRNA ORFs were translated (Figure \ref{fig:ribosome}B). This predictive power was not due to the strength of the Shine-Dalgarno sequence alone, implying that selection for protein-coding function reflected in the \(D_{n}/D_{s}\) test and composition bias was correlated with translation (Supplemental Figure 5). For E. coli K12 MG1655, we only predicted 1-2 sRNA ORFs under selection to maintain protein coding, but there was still significant evidence for translation of sRNA ORFs that appeared to be predicted by the coding score. Our methods may be unable to predict these translated protein products with statistical confidence even if they are functional, for example because the SD score has little predictive power in E. coli (Supplemental Figure 1), and because translated sRNA ORFs may have different expression profiles and different sequence features compared to annotated ORFs.

Translated ORFs typically had peaks of ribosome profiling signal concentrated at the start or stop codons, as was the case with many full-length annotated ORFs (Figure \ref{fig:ribosome}C). Many sRNAs that were previously identified only by high-throughput screens had evidence for translation, such as the B. subtilis sRNA sbsu2300.1, which was defined based on tiling array data (\cite{rasmussen2009}, Figure \ref{fig:ribosome}C, top). Some sRNAs have multiple ORFs between 10 and 50 amino acids long, making the assignment of ribosome profiling coverage to individual ORFs ambiguous. For example, the CsrC noncoding RNA in E. coli has ORFs overlapping in different frames (Figure \ref{fig:ribosome}C, bottom). In this case, only one ORF under the coverage peak had a coding score of greater than 0.5, showing that the coding score can help to prioritize ORFs for follow-up experiment even when evidence for translation is ambiguous.