# Robin C. Friedman$${}^{a,b,c,h}$$, Stefan Kalkhof$${}^{d,e}$$, Olivia Doppelt-Azeroual$${}^{c}$$, Stephan Mueller$${}^{d}$$, Martina Chovancová$${}^{d}$$, Martin von Bergen$${}^{d,f,g}$$, Benno Schwikowski$${}^{a,c}$$ $${}^{a}$$ Systems Biology Laboratory, Department of Genomes and Genetics, Institut Pasteur, Paris, France $${}^{b}$$ Molecular Microbial Pathogenesis Unit, Department of Cell Biology and Infection, Institut Pasteur, Paris, France $${}^{c}$$ Center of Bioinformatics, Biostatistics and Integrative Biology, Institut Pasteur, Paris, France $${}^{d}$$ Department of Proteomics, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany $${}^{e}$$ Current Address: Department of Bioanalytics, University of Applied Sciences and Arts of Coburg, Coburg, Germany $${}^{f}$$ Department of Metabolomics, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany $${}^{g}$$ Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark $${}^{h}$$ Corresponding author (robin.friedman@gmail.com)

Abstract

Background:

While eukaryotic noncoding RNAs have recently received intense scrutiny, it is becoming clear that bacterial transcription is at least as pervasive. Bacterial small RNAs and antisense RNAs (sRNAs) are often assumed to be noncoding, due to their lack of long open reading frames (ORFs). However, there are numerous examples of sRNAs encoding for small proteins, whether or not they also have a regulatory role at the RNA level.

Results:

Here, we apply flexible machine learning techniques based on sequence features and comparative genomics to quantify the prevalence of sRNA ORFs under natural selection to maintain protein-coding function in 14 phylogenetically diverse bacteria. A majority of annotated sRNAs have at least one ORF between 10 and 50 amino acids long, and we conservatively predict that $$188\pm 25.5$$ unannotated sRNA ORFs are under selection to maintain coding, an average of 13 per species considered here. This implies that overall at least $$7.5\pm 0.3\%$$ of sRNAs have a coding ORF, and in some species at least 20% do. $$84\pm 9.8$$ of these novel coding ORFs have some antisense overlap to annotated ORFs. As experimental validation, many of our predictions are translated in ribosome profiling data and are identified via mass spectrometry shotgun proteomics. B. subtilis sRNAs with coding ORFs are enriched for high expression in biofilms and confluent growth, and S. pneumoniae sRNAs with coding ORFs are involved in virulence. sRNA coding ORFs are enriched for transmembrane domains and many are novel components of type I toxin/antitoxin systems.

Conclusions:

We predict over a dozen new protein-coding genes per bacterial species, but crucially also quantified the uncertainty in this estimate. Our predictions for sRNA coding ORFs, along with novel type I toxins and tools for sorting and visualizing genomic context, are freely available in a user-friendly format at http://disco-bac.web.pasteur.fr. We expect these easily-accessible predictions to be a valuable tool for the study not only of bacterial sRNAs and type I toxin-antitoxin systems, but also of bacterial genetics and genomics.

Keywords: sRNAs, type I toxin/antitoxin, short ORFs, machine learning, ribosome profiling, mass spectrometry

## Background

Recent technological advances such as tiling microarrays and deep RNA sequencing have led to a new appreciation of bacterial transcription, identifying thousands of new bacterial small RNAs (sRNAs) (Waters 2009, Irnov 2010, Li 2012, Wade 2014). Single strains can contain hundreds of sRNAs, including both independent transcripts and extensive transcription antisense to annotated open reading frames (ORFs), even in bacteria lacking the RNA-binding protein Hfq (Sharma 2010, Rasmussen 2009, Irnov 2010). In virtually all cases, sRNAs contain no annotated ORFs and it is therefore assumed that their primary function is to act as antisense RNAs modulating the expression of other genes (Storz 2011, Wade 2014).

However, gene annotations for short ORFs (usually defined as shorter than 100 or 50 amino acids) are notoriously incomplete and thousands of protein-coding genes remain unannotated in bacteria (Warren 2010), so many sRNAs could in theory code for functional small proteins. There are several examples of dual-function sRNAs having a regulatory role that also code for an experimentally-validated functional small protein: E. coli SgrS encodes the protein SgrT (Wadler 2007), S. aureus RNAIII encodes $$\delta$$-hemolysin (Williams 1947), B. subtilis SR1 encodes SR1P (Gimpel 2010), and P. aeruginosa PhrS encodes an unnamed protein (Sonnleitner 2008). However, bec