# Robin C. Friedman$${}^{a,b,c,h}$$, Stefan Kalkhof$${}^{d,e}$$, Olivia Doppelt-Azeroual$${}^{c}$$, Stephan Mueller$${}^{d}$$, Martina Chovancová$${}^{d}$$, Martin von Bergen$${}^{d,f,g}$$, Benno Schwikowski$${}^{a,c}$$ $${}^{a}$$ Systems Biology Laboratory, Department of Genomes and Genetics, Institut Pasteur, Paris, France $${}^{b}$$ Molecular Microbial Pathogenesis Unit, Department of Cell Biology and Infection, Institut Pasteur, Paris, France $${}^{c}$$ Center of Bioinformatics, Biostatistics and Integrative Biology, Institut Pasteur, Paris, France $${}^{d}$$ Department of Proteomics, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany $${}^{e}$$ Current Address: Department of Bioanalytics, University of Applied Sciences and Arts of Coburg, Coburg, Germany $${}^{f}$$ Department of Metabolomics, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany $${}^{g}$$ Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark $${}^{h}$$ Corresponding author (robin.friedman@gmail.com)

Abstract

Background:

While eukaryotic noncoding RNAs have recently received intense scrutiny, it is becoming clear that bacterial transcription is at least as pervasive. Bacterial small RNAs and antisense RNAs (sRNAs) are often assumed to be noncoding, due to their lack of long open reading frames (ORFs). However, there are numerous examples of sRNAs encoding for small proteins, whether or not they also have a regulatory role at the RNA level.

Results:

Here, we apply flexible machine learning techniques based on sequence features and comparative genomics to quantify the prevalence of sRNA ORFs under natural selection to maintain protein-coding function in 14 phylogenetically diverse bacteria. A majority of annotated sRNAs have at least one ORF between 10 and 50 amino acids long, and we conservatively predict that $$188\pm 25.5$$ unannotated sRNA ORFs are under selection to maintain coding, an average of 13 per species considered here. This implies that overall at least $$7.5\pm 0.3\%$$ of sRNAs have a coding ORF, and in some species at least 20% do. $$84\pm 9.8$$ of these novel coding ORFs have some antisense overlap to annotated ORFs. As experimental validation, many of our predictions are translated in ribosome profiling data and are identified via mass spectrometry shotgun proteomics. B. subtilis sRNAs with coding ORFs are enriched for high expression in biofilms and confluent growth, and S. pneumoniae sRNAs with coding ORFs are involved in virulence. sRNA coding ORFs are enriched for transmembrane domains and many are novel components of type I toxin/antitoxin systems.

Conclusions:

We predict over a dozen new protein-coding genes per bacterial species, but crucially also quantified the uncertainty in this estimate. Our predictions for sRNA coding ORFs, along with novel type I toxins and tools for sorting and visualizing genomic context, are freely available in a user-friendly format at http://disco-bac.web.pasteur.fr. We expect these easily-accessible predictions to be a valuable tool for the study not only of bacterial sRNAs and type I toxin-antitoxin systems, but also of bacterial genetics and genomics.

Keywords: sRNAs, type I toxin/antitoxin, short ORFs, machine learning, ribosome profiling, mass spectrometry

## Background

Recent technological advances such as tiling microarrays and deep RNA sequencing have led to a new appreciation of bacterial transcription, identifying thousands of new bacterial small RNAs (sRNAs) (Waters 2009, Irnov 2010, Li 2012, Wade 2014). Single strains can contain hundreds of sRNAs, including both independent transcripts and extensive transcription antisense to annotated open reading frames (ORFs), even in bacteria lacking the RNA-binding protein Hfq (Sharma 2010, Rasmussen 2009, Irnov 2010). In virtually all cases, sRNAs contain no annotated ORFs and it is therefore assumed that their primary function is to act as antisense RNAs modulating the expression of other genes (Storz 2011, Wade 2014).

However, gene annotations for short ORFs (usually defined as shorter than 100 or 50 amino acids) are notoriously incomplete and thousands of protein-coding genes remain unannotated in bacteria (Warren 2010), so many sRNAs could in theory code for functional small proteins. There are several examples of dual-function sRNAs having a regulatory role that also code for an experimentally-validated functional small protein: E. coli SgrS encodes the protein SgrT (Wadler 2007), S. aureus RNAIII encodes $$\delta$$-hemolysin (Williams 1947), B. subtilis SR1 encodes SR1P (Gimpel 2010), and P. aeruginosa PhrS encodes an unnamed protein (Sonnleitner 2008). However, because most known sRNAs have no known antisense regulatory function (Li 2012), their primary function could be simply coding for functional peptides.

Small proteins play important roles in bacteria, including quorum sensing, transcription, translation, stress response, metabolism, and sporulation (Zuber 2001, Hobbs 2011). However, they are difficult to identify by computational or experimental methods. The short sequences have less space for evidence of natural selection, resulting in high levels of statistical noise and false postives, making computational discrimination of coding ORFs smaller than about 50 amino acids difficult (Warren 2010, Samayoa 2011). Standard proteomics methods usually utilize gel electrophoresis, which biases towards proteins larger than about 30 kDa and precludes detection of very small proteins (Garbis 2005, Hemm 2010). Proteolytic cleavage of some small proteins also results in no peptides of a length detectable by mass spectrometers.

Nevertheless, efforts to identify bacterial short coding sequences have had some success. Proteogenomics, the reannotation of genomes using mass-spectrometry-based proteomics, is a powerful tool for identifying protein-coding genes but still suffers from false negatives, especially for small proteins (Hemm 2010, Tinoco 2011, Müller 2013). Most computational methods applied so far have not taken advantage of sRNA annotations and either used comparative genomics information exclusively (Warren 2010, Washietl 2011) or were applied only to a single species (Ibrahim 2007, Hemm 2008, Samayoa 2011). No existing method is ideal for determining the overall number of sRNA coding ORFs. Some comparative genomics methods take into account more information than the $$D_{n}/D_{s}$$ test, but more complexity can make algorithms more brittle. For PhyloCSF (Lin 2011), a greater number of parameters to fit can be problematic for small bacterial genomes and this method remains untested on prokaryotes. RNAcode (Washietl 2011) handles multiple alignment issues like insertions and deletions intelligently, but because it does not take into account phylogenetic structure it relies on careful selection of orthologous species to yield relevant results, making it difficult to apply on a large scale. Warren et al. (Warren 2010) used a clever BLAST-based approach to quickly find new genes, but this is less sensitive than $$D_{n}/D_{s}$$, which is aware of phylogeny and mutations at the DNA level. Other methods are either ad-hoc and difficult to apply to other species (Hemm 2008) and/or do not incorporate both sequence features and comparative genomics (Ibrahim 2007).

Short proteins can rarely be predicted with nearly 100% confidence because of limited evidence, but most standard gene annotation tools do not provide an estimated false discovery rate (FDR) for marginal predictions, instead choosing ad-hoc cutoffs for amino acid length or coding score. However, even without confident individual predictions, statistically sound conclusions can be made when considering short ORFs in aggregate; for example, the overall number of ORFs under natural selection to maintain protein-coding potential can be estimated.

To identify short coding sequences in diverse species with high fidelity, algorithms must adapt to composition biases such as GC content, the strength and frequency of Shine-Dalgarno sequence motifs, the availability of closely-related genomes, and the structure of the phylogenetic tree relating these species. We set out to reexamine the assumption that most sRNAs are noncoding by applying simple and adaptable computational and statistical methods to a broad range of bacterial species, paying special attention to controlling for several biases in sRNA ORF sequence properties. We developed a computational method to predict coding ORFs called Discovery of sRNA Coding ORFs in Bacteria (DiSCO-Bac). We then validate the translation of predicted coding sRNA ORFs with experimental data from ribosome profiling and mass spectrometry. We also mine experimental data from various sources to show that many of the resulting small proteins are functional, and a surprising number may be encoded antisense to other RNAs, many of which represent toxin-antitoxin systems.