Background

Recent technological advances such as tiling microarrays and deep RNA sequencing have led to a new appreciation of bacterial transcription, identifying thousands of new bacterial small RNAs (sRNAs) \cite{Waters_2009,Irnov_2010,ng_Cheung_Nong_Huang_Kwan_2012,Wade_2014}. Single strains can contain hundreds of sRNAs, including both independent transcripts and extensive transcription antisense to annotated open reading frames (ORFs), even in bacteria lacking the RNA-binding protein Hfq \cite{Sharma_2010,Rasmussen_2009,Irnov_2010}. In virtually all cases, sRNAs contain no annotated ORFs and it is therefore assumed that their primary function is to act as antisense RNAs modulating the expression of other genes \cite{Storz_2011,Wade_2014}.

However, gene annotations for short ORFs (usually defined as shorter than 100 or 50 amino acids) are notoriously incomplete and thousands of protein-coding genes remain unannotated in bacteria \cite{Warren_2010}, so many sRNAs could in theory code for functional small proteins. There are several examples of dual-function sRNAs having a regulatory role that also code for an experimentally-validated functional small protein: E. coli SgrS encodes the protein SgrT \cite{Wadler_2007}, S. aureus RNAIII encodes \(\delta\)-hemolysin \cite{Williams_1947}, B. subtilis SR1 encodes SR1P \cite{Gimpel_2010}, and P. aeruginosa PhrS encodes an unnamed protein \cite{Sonnleitner_2008}. However, because most known sRNAs have no known antisense regulatory function \cite{ng_Cheung_Nong_Huang_Kwan_2012}, their primary function could be simply coding for functional peptides.

Small proteins play important roles in bacteria, including quorum sensing, transcription, translation, stress response, metabolism, and sporulation \cite{Zuber_2001,Hobbs_2011}. However, they are difficult to identify by computational or experimental methods. The short sequences have less space for evidence of natural selection, resulting in high levels of statistical noise and false postives, making computational discrimination of coding ORFs smaller than about 50 amino acids difficult \cite{Warren_2010,Samayoa_2011}. Standard proteomics methods usually utilize gel electrophoresis, which biases towards proteins larger than about 30 kDa and precludes detection of very small proteins \cite{Garbis_2005,Hemm_2010}. Proteolytic cleavage of some small proteins also results in no peptides of a length detectable by mass spectrometers.

Nevertheless, efforts to identify bacterial short coding sequences have had some success. Proteogenomics, the reannotation of genomes using mass-spectrometry-based proteomics, is a powerful tool for identifying protein-coding genes but still suffers from false negatives, especially for small proteins \cite{Hemm_2010,Tinoco_2011,M_ller_2013}. Most computational methods applied so far have not taken advantage of sRNA annotations and either used comparative genomics information exclusively \cite{Warren_2010,Washietl_2011} or were applied only to a single species \cite{Ibrahim_2007,Hemm_2008,Samayoa_2011}. No existing method is ideal for determining the overall number of sRNA coding ORFs. Some comparative genomics methods take into account more information than the \(D_{n}/D_{s}\) test, but more complexity can make algorithms more brittle. For PhyloCSF \cite{Lin_Jungreis_Kellis_2011}, a greater number of parameters to fit can be problematic for small bacterial genomes and this method remains untested on prokaryotes. RNAcode \cite{Hofacker_Stadler_Goldman_2011} handles multiple alignment issues like insertions and deletions intelligently, but because it does not take into account phylogenetic structure it relies on careful selection of orthologous species to yield relevant results, making it difficult to apply on a large scale. Warren et al. \cite{Warren_2010} used a clever BLAST-based approach to quickly find new genes, but this is less sensitive than \(D_{n}/D_{s}\), which is aware of phylogeny and mutations at the DNA level. Other methods are either ad-hoc and difficult to apply to other species \cite{Hemm_2008} and/or do not incorporate both sequence features and comparative genomics \cite{Ibrahim_2007}.

Short proteins can rarely be predicted with nearly 100% confidence because of limited evidence, but most standard gene annotation tools do not provide an estimated false discovery rate (FDR) for marginal predictions, instead choosing ad-hoc cutoffs for amino acid length or coding score. However, even without confident individual predictions, statistically sound conclusions can be made when considering short ORFs in aggregate; for example, the overall number of ORFs under natural selection to maintain protein-coding potential can be estimated.

To identify short coding sequences in diverse species with high fidelity, algorithms must adapt to composition biases such as GC content, the strength and frequency of Shine-Dalgarno sequence motifs, the availability of closely-related genomes, and the structure of the phylogenetic tree relating these species. We set out to reexamine the assumption that most sRNAs are noncoding by applying simple and adaptable computational and statistical methods to a broad range of bacterial species, paying special attention to controlling for several biases in sRNA ORF sequence properties. We developed a computational method to predict coding ORFs called Discovery of sRNA Coding ORFs in Bacteria (DiSCO-Bac). We then validate the translation of predicted coding sRNA ORFs with experimental data from ribosome profiling and mass spectrometry. We also mine experimental data from various sources to show that many of the resulting small proteins are functional, and a surprising number may be encoded antisense to other RNAs, many of which represent toxin-antitoxin systems.