IntroductionMembrane proteins constitute 20-30% of the genome of any given \cite{Stevens_2000} \cite{Stevens_2000}. The chemical and spacial characteristics of their environment neatly separate this class from the rest of the proteome: in the lipid bilayer, proteins acquire geometric constraints, distinct functions and dynamics (Cournia 2015), and a predisposition for internal and quaternary symmetries [ref]. Moreover, it has been estimated that membrane proteins are targeted by around half \cite{Cournia_2015}of all FDA-approved drugs (Davey 2004) and a similar proportion of physiologically-relevant small ligands \cite{Bull_2015}, making \cite{Bull_2015}them extremely relevant in both cellular biology and \cite{Davey_2004}pharmaceutics.However, the same environment that gives membrane proteins their distinctive features renders them particularly elusive to current methodologies for structure determination. Only recently have high-throughput crystallization techniques become available for integral proteins, and many technical enhancements are still ongoing \cite{Moraes_2014}. As of January 2017, membrane proteins of known structure \cite{Berman_2000}account for less than 2% of the Protein Databank (PDB) \cite{Berman_2000}, of which 671 are unique proteins (http://blanco.biomol.uci.edu/mpstruc/ \cite{White_2009}\cite{White_2009}). The number of newly resolved membrane proteins is increasing at about the same pace as that of the entire proteome, presenting a considerable challenge for studies relying on an up-to-date set of available structural data, such as modeling, protein evolution, and structure quality assessment. For example, the need for better quality and more informative classifications is driving the design of new tools specific to membrane protein sequences, structures and functions.\cite{Stansfeld_2014}\cite{Stansfeld_2012}Databases dedicated solely to structures of membrane proteins are available, two of the most renown being PDBTM \cite{Tusnady_2004}and OPM \cite{Lomize_2006}. PDBTM is a collection of membrane protein structures compiled weekly, based on an automated analysis of new structures in the PDB. OPM is thought to be more accurate in reporting the orientation of the protein in the lipid bilayer \cite{Lomize_2006a}(Rath 2013)(Lomize 2006), yet manual supervision considerably lessens the frequency of updates and sometimes hinders automatic parsing. In addition to coordinate files, both databases report very useful and often complementary information about membrane-spanning segments, orientation, and ranking in well-known whole-proteome classifications such as SCOP \cite{Hubbard_1997}, TCDB \cite{Saier_2006} and OCA \cite{OCA}. More recently, MemProtMD \cite{Stansfeld_2014}\cite{Stansfeld_2014}\cite{Stansfeld_2014}\cite{Stansfeld_2014} positions each membrane protein of known structure in an explicit (yet coarse-grained) modeled lipid layer. However, none of these databases provide original classifications or relations between the proteins that they enumerate. Moreover, within the context of the whole-protein classifications (SCOP, CATH as well as DALI \cite{Holm_1995}), the classifications of membrane proteins have been found to be contradictory [LF: in what sort way?] \cite{Frishman_2010}. Classifications specific to membrane proteins, or subsets thereof, are found exclusively at the genome or sequence level \cite{Golmohammadi_2007}\cite{Golmohammadi_2007} , despite the clear advantages that a classification on the structural level would provide, the limited number of available structures has so far hindered any automated, large-scale approaches to this problem. [what about \cite{Hubbard_1997}\cite{Orengo_1997}\cite{Holm_1995}\cite{Frishman_2010}website - should we mention that above, or here?]It is worth nothDuring the last decade, doubts over the actual feasibility of an accurate rigid structural classification started spreading significantly \cite{Valas_2009}\cite{Bourne_2005}\cite{Bourne_2005}, as it became clearer that protein structures cannot be organized in a set of discrete classes, but rather distribute in a continuous fashion over the structure space \cite{Skolnick_2009}. This fact weakens the key assumption behind top-down approaches (such as SCOP and CATH) that it is possible to define a priori well-separated structural classes, and at the same time drastically increases the risk for any bottom-up methodology (such as Dali) to be too dependent on the fine-tuning of its parameters. As Petrey and Honig suggest, more dynamical approaches integrating several different kinds of annotations and the use of residue-level estimators are needed in order to increase accuracy and robustness of future databases \cite{Petrey_2009}.\cite{Petrey_2009}We present here the novel Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry (EncoMPASS), a fully-automatic database by which we aim to introduce a new and more flexible way of portraying the structural relationships among experimentally resolved membrane proteins. Our objective is to provide a full structural context for any resolved membrane protein through the emphasis of sequence and structural similarities and the identification of the complete set of its internal and external symmetries. The EncoMPASS library takes information from the OPM database and the PDB annotations, and organizes it in a consistent set of membrane protein structures. For each protein chain we made available a list of similar structures and a set of similarity measures: sequence- and structure- based sequence identity from MUSCLE \cite{Edgar} and Fr-TM-Align \cite{Pandit_2008} alignments respectively, TM-score \cite{Zhang_2004}, RMSD and novel residue-level RMSD-based estimators. All proteins have also been analyzed with CE-Symm \cite{Myers_Turnbull_2014} and SymD \cite{Kim_2010} in order to find internal and external symmetries. In addition to the information provided by the two algorithms, further analyses completing and harmonizing their sometimes different results will be made available.EncoMPASS does not contain any kind of rigid classification (such as the ones used by SCOP or CATH). Thus, while being an accurately annotated collection of protein structures, it is not - and does not intend to be - a collection of structural elements, be it domains, folds, bundles or fragments. Indeed, despite the several similarity measures we calculated could imply the development of a global classification of membrane proteins, we are not convinced of either the feasibility, accuracy or necessity of this type of classifications, even at the proteome-wide scale. We decide to take our cue from these considerations and focus on providing a set of tools helping the user finding the type of connections they are looking for without imposing a priori classifications.In the following sections we will first present and justify the methodologies we used to produce the database. In the Algorithm section we will explain the typical workflow of the EncoMPASS building program, by detailing the main steps a PDB entry undergoes when it is added to the database. In the Implementation section we will instead describe the features and analyses the users can exploit when browsing the online database.MethodsChains as fundamental structural unitsDuring the past decades, the bias toward the study of globular proteins has contributed to promote the importance of structural domains in the evolution and organization of a protein. Domains have in fact been declared to carry structural significance as compact units \cite{Richardson_1981}. and main actors of the folding process \cite{Wetlaufer_1973}. Moreover, they have been connected with function specificity \cite{Bork_1991}\cite{Wetlaufer_1973}. The key concept underneath all three aspects is the alleged structural and often functional and evolutional independence of protein domains, which might lead to consider them as the veritable building blocks of a protein \cite{Janin_1983}. This view is still embraced by many popular studies and resources, of which SCOP is a notable example \cite{Andreeva_2007}. Nonetheless, in recent times the relevance of domains is being reassessed: alternative fundamental structures have been proposed \cite{Berezovsky_2016}, while exceptions to the structural independence of protein domains during have emerged \cite{Wu_2008}\cite{Espada_2015}. Coevolution models are helping defining small protein fragments as another unit in terms of evolution \cite{Dib_2012}. Moreover, the membrane proteome was never fully harmonized in the protein domain picture, due to the lack of multi-domain proteins and their relative disinclination to domain recombination \cite{Liu_2004}. For these reasons, EncoMPASS performs assessments on structural relationships and symmetry detections considering protein chains as fundamental units. We only consider chains with at least one transmembrane region. The resolution of the PDB must be under 3.5 A, and the chain must not contain undefined residues and must not miss large structural parts ( > 100 residues). We also exclude models which are not determined through X-ray crystallography. At the moment, our database contains 6666 chains extracted from 2105 PDB entries.Building structural relations among the chain unitsThe simplicity and unambiguity of the definition of the chain units is contrasted by their number, large scale and flexibility, which makes structural comparison a complex task. We did not attempt an all-vs-all structural comparison for two reasons: first, the number of chain structures is very high, and would result in an onerous load for our machines. Second, even if we managed to perform all alignments, a great part of them would be meaningless and could convey misleading information. Hence, we only compare those structures having the same number of transmembrane regions. We are aware that this constitutes a limitation of the method, which we are already planning to overcome by adding other criteria for the allowed alignments. Nonetheless, we made this choice in order to keep the number of falsely related chains in our database as low as possible.For detecting the transmembrane regions, we start from the data produced by OPM (or its membrane insertion routine). OPM defines as "segments" the secondary structures totally or partially inserted in the membrane. For our scopes, we define a transmembrane region as a set of contiguous residues whose C_alpha atom is immersed at least 1 A inside the bilayer. The subchain must also contain at least one of the segments defined by OPM (i.e., it cannot be completely unstructured, and cannot be too short). In order to overcome the problem of aligning two structures having a high degree of flexibility, it is essential to choose an alignment method which can cope with the high degree of flexibility of such structures. We opted for Fr-TM-Align \cite{Pandit_2008}, which combines a careful fragment-based assessment of the portions of the two chains to be superimposed and a thorough optimization of the rotation which maximizes the TM-score scoring function:\({\text{TM-score}}\ =\ \frac{1}{L}\left[\sum^{L_{ali}}_{i=1}\frac{1}{1+d_i^2/d_0^2}\right]\)where \(d_0\ =\ \sqrt[3]{L - 15} - 1.8\)Here, L is the length of the chain and d_i is the distance of the i-th equivalent pair of C-alpha atoms of the two structures. This metric is found to be more accurate than RMSD, especially in cases where the resemblance between the two structures is partial. It is independent of the size of the protein and it was found to correlate with topology information. Specifically, structures having a TM-score > 0.5 are likely to share the same fold, and structures having a TM-score < 0.5 are likely to have different ones \cite{Xu_2010}. Previous studies highlighted that a threshold of 0.6 is more sensitive to structural differences in membrane proteins (probably due to a greater overall structure resemblance). Thus, for all analyses contained in this work we will declare as structurally akin any two structures with TM-score > 0.6 \cite{Stamm_2015}.The sequence identity is defined as the number of matching residues normalized by the total number of aligned residues (matches and mismatches, but not gaps). This is one of the many possible ways to determine sequence identity, for which there does not exist an agreed-upon definition. We chose it because it is coherent with our intention to focus on structure rather than on sequence: by not counting the unaligned parts, we do not take into consideration the effective length of the chains, but we normalize only on the parts where the alignment programs have faoAgain taking from previous studies, we define two protein chains sequence-wise similar if their sequence identity is greater than 0.85. This threshold selects only extremely related proteins (often the same protein in different species), which agrees with our aim of not using information derived from sequence for assessing relationships between proteins, but which at the same time helps us in highlighting the degree of similarity between different conformations of a same protein. Symmetry detection in membrane proteinsHow symmetry detection works in general (through the examples of ce-symm and symd). Why membrane proteins are special in this respect (restrictions due to the lipid layer, proteins are in motion and we want to track differences in symmetry between conformations). How cesymm and symd behave with membrane proteins. Which other tools we provide.[Figure 1: Flowchart, referring to Algorithm section (not to clutter the figures in the Implementation part)]AlgorithmEncoMPASS is created and updated automatically through a series of routines exploiting a library of Python 3.5 functions freely available at (GIT address). The bundle does not contain the external programs for finding the correct orientation of the protein in the membrane (PPM), for sequence and structure alignment (respectively, MUSCLE and Fr-TM-Align) and for symmetry detection (CE-Symm and SymD), which can be obtained freely. Highly parallel routines are run on the LoBoS cluster hosted at NIH (website).In order to better illustrate the workflow of the algorithm, we explain step by step the process of adding a new structure to the database, also depicted in Figure 1.Structure acquisition, selection and homogenizationThe main list of resolved structures published on PDBTM is scanned in search of unclassified entries. If a new one is found, information regarding the structure on OPM and PDB is acquired. Specifically, the coordinate file is downloaded from both databases, whereas the information about membrane insertion and TM segments is taken from OPM, and the FASTA sequence of all chains is taken from the PDB.A preliminary number of checks is carried out in order to test the consistency of the downloaded resources. During this operation, the filters on the resolution and described in Methods are applied, namely the exclusion of all structures not resolved by X-ray crystallography, and entries not meeting basic structural requirements. During the initial run, a total of XXX structures out of YYY have been discarded. After passing the filters, all downloaded files undergo strict consistency checks: if files downloaded from the PDB are found to be in disagreement, the structure is excluded. If the inconsistencies are among PDB and OPM files, PDB files are always preferred, and the entry is added to the run-list for PPM, the program generating OPM data. The coordinate file is made to represent only one biological molecule, according to the first BIOMOLECULE model listed in the PDB REMARK 300 and 350. When all the new entries are processed, those which have to be positioned in the membrane undergo a PPM run on the LoBoS cluster.Structure and sequence alignments, and their statistical estimatorsWhen the orientation of all structures in the lipid bilayer has been determined, the number of transmembrane regions for each chain is calculated. Chains are divided by their alpha or beta secondary structure classification and by number of transmembrane regions contained. Table 1 provides the number of structures present in each topological category. A large part of the chains that have been resolved so far contain a very limited number of transmembrane regions. This is to be expected, since often proteins with large soluble parts are easier to crystallize. In each of the resulting sets, all-vs-all structure and sequence alignments are performed on the LoBoS cluster by MUSCLE and Fr-TM-Align runs. The alignments are then collected, and a series of statistics are extracted from each of them: the sequence identity based on the sequence alignment, the one based on the structure alignment, the TM-score and the RMSD (both based on the structure alignment). The distance between any two aligned C_alpha is also recorded and the raw output files of the program are stored.Assessment of internal and external symmetriesImplementationThe EncoMPASS web database is hosted on the public location (URL). The main page contains a brief introduction and general statistics about the number of classified protein chains (here summarized in Figure 5). From the main page, the user can navigate to any entry by typing the corresponding PDB code in the bar. PDB codes with chain name specification are also allowed (e.g. "1okc_A").In Figure 2a a typical page describing a whole protein is presented. Information about the model has been divided into 5 sections. The first one contains a general overview: information from the PDB is reported, and a .gif animation is placed as a visual reference. The coordinate file, often modified during the homogenization procedure described in Algorithm, is available for download, as well as a Pymol script to reproduce the .gif file. The second section is dedicated to the structural analyses, and contains links to the pages dedicated to the single chains and a table reporting their secondary structure type, number of transmembrane regions and number of sequence, structure and total (structure and sequence) neighbors, as defined in Methods.Other three sections collect symmetry data and analyses: the first one presents all data computed by CE-Symm and SymD, XXXCOMPLETEHEREXXX.DiscussionEncoMPASS is an online database for the analysis of the structure and symmetry of membrane proteins. Presently, it contains 2105 PDB entries and 6666 single-chain structures. Although the entries are not organized in a rigid classification, several similarity measures are provided. Any two chains containing the same number of transmembrane regions were compared by sequence and structure alignment. The similarity was assessed by sequence identity (on both alignments) , TM-score and RMSD.Our purpose with EncoMPASS is thus to offer a consistent set of similarity estimators which can be explored and combined to the user's need. The library can be used as a reliable benchmark for sequence alignment algorithms and structural and functional domain finders, due to its internal consistency and very restricted set of initial assumptions. EncoMPASS can also be used to infer functionality of membrane proteins and to organize the resources for homology modeling studies. Indeed, the database is able to highlight connections which could be missed in a standard sequence- or structure-based classification, thus making this tool suitable for providing sequence-based studies with a new kind of supplementary information.