Contaminants derived from consumables, reagents, and sample handling often negatively affect LC-MS data acquisition. In proteomics experiments, they can markedly reduce identification performance, reproducibility, and quantitative robustness. Here, we introduce a data analysis workflow combining MS1 feature extraction in Skyline with HowDirty, an R-markdown-based tool, that automatically generates an interactive report on the molecular contaminant level in LC-MS data sets. To facilitate the interpretation of the results, the HTML report is self-contained and self-explanatory, including plots that can be easily interpreted. The R package HowDirty is available from https://github.com/DavidGZ1/HowDirty. To demonstrate a showcase scenario for the application of HowDirty, we assessed the impact of ultrafiltration units from different providers on sample purity after filter-assisted sample preparation (FASP) digestion. This allowed us to select the filter units with the lowest contamination risk. Notably, the filter units with the lowest contaminant levels showed higher reproducibility regarding the number of peptides and proteins identified. Overall, HowDirty enables the efficient evaluation of sample quality covering a wide range of common contaminant groups that typically impair LC-MS analyses, facilitating taking corrective or preventive actions to minimize instrument downtime.
Relative and absolute intensity-based protein quantification across cell lines, tissue atlases, and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity, and correlation with RNA expression. Most studies provide MS1 feature-based label-free quantitative (LFQ) datasets; however, growing numbers of isobaric tandem mass tags (TMT) datasets remain unexplored. Here, we compare traditional intensity-based absolute quantification (iBAQ) proteome abundance ranking to an analogous method using reporter ion proteome abundance ranking with data from an experiment where LFQ and TMT were measured on the same samples. This new TMT method substitutes reporter ion intensities for MS1 feature intensities in the iBAQ framework. Additionally, we compared LFQ-iBAQ values to TMT-iBAQ values from two independent large-scale tissue atlas datasets (one LFQ and one TMT) using robust bottom-up proteomic identification, normalisation, and quantitation workflows.
In proteomics, fast, efficient and highly reproducible sample preparation is of utmost importance, particularly in view of fast scanning mass spectrometers enabling analyses of large sample series. To address this need, we have developed the web application MassSpecPreppy that operates on the open science OT-2 liquid handling robot from Opentrons. This platform can prepare up to 96 samples at once, performing tasks like BCA protein concentration determination, sample digestion with normalization, reduction/alkylation and peptide elution into vials or loading specified peptide amounts onto Evotips in an automated and flexible manner. The performance of the developed workflows using MassSpecPreppy was compared with standard manual sample preparation workflows. The BCA assay experiments revealed an average recovery of 101.3% (SD: ±7.82%) for the MassSpecPreppy workflow, while the manual workflow had a recovery of 96.3% (SD: ±9.73%). The species mix used in the evaluation experiments showed that 94.5% of protein groups for OT-2 digestion and 95% for manual digestion passed the significance thresholds with comparable peptide level coefficient of variations. These results demonstrate that MassSpecPreppy is a versatile and scalable platform for automated sample preparation, producing injection-ready samples for proteomics research.
Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community-wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non-physiological complexes. The non-physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein-protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non-physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross-validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94 respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines were shown to recall the physiological dimers with significantly higher accuracy than the non-physiological set, lending support for the pertinence of our benchmark dataset. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.
Trans-activation response DNA binding protein of 43kDa (TDP-43) regulates a great variety of cellular processes in the nucleus and cytosol. In addition, a defined subset of neurodegenerative diseases is characterized by nuclear depletion of TDP-43 as well as cytosolic mislocalization and aggregation. To perform its diverse functions TDP-43 can associate with different ribonucleoprotein complexes. Combined with transcriptomics, MS interactome studies have unveiled associations between TDP-43 and the spliceosome machinery, polysomes and RNA granules. Moreover, the highly dynamic, low-valency interactions regulated by its low-complexity domain calls for innovative proximity labeling methodologies. In addition to protein partners, the analysis of posttranslational modifications showed that they may play a role in the nucleocytoplasmic shuttling, RNA binding, liquid-liquid phase separation and protein aggregation of TDP-43. Here we review the various TDP-43 ribonucleoprotein complexes characterized so far, how they contribute to the diverse functions of TDP-43, and roles of post-translational modifications. Further understanding of the fluid dynamic properties of TDP-43 in ribonucleoprotein complexes, RNA granules, and self-assemblies will advance the understanding of RNA processing in cells and perhaps help to develop novel therapeutic approaches for TDPopathies.
Proteins play an essential role in the vital biological processes governing cellular functions. Most proteins function as members of macromolecular machines, with the network of interacting proteins revealing the molecular mechanisms driving the formation of these complexes. Profiling the physiology-driven remodeling of these interactions within different contexts constitutes a crucial component to achieving a comprehensive systems-level understanding of interactome dynamics. Here, we apply co-fractionation mass spectrometry and computational modeling to quantify and profile the interactions of ~2,000 proteins in the bacterium Escherichia coli cultured under ten distinct culture conditions. The resulting quantitative co-elution patterns revealed large-scale condition-dependent interaction remodeling among protein complexes involved in diverse biochemical pathways in response to the unique environmental challenges. Network-level analysis highlighted interactome-wide biophysical properties and structural patterns governing interaction remodeling. Our results provide evidence of the local and global plasticity of the E. coli interactome along with a rigorous generalizable framework to define protein interaction specificity. We provide an accompanying interactive web application to facilitate exploration of these rewired networks.
Cell-derived extracellular vesicles (EVs) are evolutionary-conserved secretory organelles that, based on their molecular composition, are important intercellular signaling regulators. At least three classes of circulating EVs are known based on mechanism of biogenesis: exosomes (sEVs/Exos), microparticles (lEVs/MPs) and shed midbody remnants (sMB-Rs). sEVs/Exos are of endosomal pathway origin, microparticles (lEVs/MPs) from plasma membrane blebbing, and shed midbody remnants (sMB-Rs) arise from symmetric cytokinetic abscission. Here, we isolate sEVs/Exos, lEVs/MPs and sMB-Rs secreted from human isogenic primary (SW480) and metastatic (SW620) colorectal cancer (CRC) cell lines in milligram quantities for label-free MS/MS-based proteomic profiling. Purified EVs revealed selective composition packaging of exosomal protein markers in SW480/SW620-sEVs/Exos, metabolic enzymes in SW480/SW620-lEVs/MPs, while centralspindlin complex proteins, nucleoproteins, splicing factors, RNA granule proteins, translation-initiation factors, and mitochondrial proteins selectively traffic to SW480/SW620-sMB-Rs. Collectively, we identify 39 human cancer-associated genes in EVs; 17 associated with SW480-EVs, 22 with SW620-EVs. We highlight oncogenic receptors/transporters selectively enriched in sEVs/Exos (EGFR/ FAS in SW480-Exos and MET, TGFBR2, ABCB1 in SW620-sEVs/Exos). Interestingly, MDK, STAT1, and TGM2 are selectively enriched in SW480-sMB-Rs, and ADAM15 to SW620-sMB-Rs. Our study reveals sEVs/Exos, lEVs/MPs and sMB-Rs have distinct protein signatures that open potential diagnostic avenues of distinct types of EVs for clinical utility.
Over the past two decades, there has been increasing research into the molecular composition and function of small extracellular vesicles in the central nervous system. This is due in part to the recognition that small extracellular vesicles likely contribute to the pathogenesis of neurological diseases such as Alzheimer's disease, but also an understanding that small extracellular vesicles are a source of potential biomarkers. Small extracellular vesicles carry specific cargo that reflects their biogenesis and cellular origins, including protein, RNA and lipid. While the protein and RNA content of small extracellular vesicles in the central nervous system diseases and have been studied extensively, our understanding of the lipidome of small extracellular vesicles in the central nervous system is still in its infancy. Lipids play a significant role in maintaining central nervous system structure and function, and the dysregulation of lipid metabolism is known to occur in many neurological disorders, including Alzheimer's disease. Here we review what is currently known about lipid dyshomeostasis in Alzheimer's disease. We propose that small extracellular vesicle lipids may provide insight into the pathophysiology and progression of Alzheimer's disease and other neurological disorders, and, in the future perhaps, aid in disease monitoring and detection.
Native mass spectrometry is a rapidly emerging technique for fast and sensitive structural analysis of protein constructs, maintaining the protein higher order structure. The coupling with electromigrative separation techniques under native conditions enables the characterization of proteoforms and highly complex protein mixtures. In this review, we present an overview of current native CE-MS technology. First, the status of native separation conditions is described for capillary zone electrophoresis (CZE), affinity capillary electrophoresis (ACE), and capillary isoelectric focusing (CIEF), as well as their chip-based formats, including essential parameters such as electrolyte composition and capillary coatings. Further, conditions required for native ESI-MS of (large) protein constructs, including instrumental parameters of QTOF and Orbitrap systems, as well as requirements for native CE-MS interfacing are presented. On this basis, methods and applications of the different modes of native CE-MS are summarized and discussed in the context of biological, medical, and biopharmaceutical questions. Finally, key achievements are highlighted and concluded, while remaining challenges are pointed out.
For decades, molecular biologists have been uncovering the mechanics of biological systems. Efforts to bring their findings together have led to the development of multiple databases and information systems that capture and present pathway information in a computable network format. Concurrently, the advent of modern omics technologies has empowered researchers to systematically profile cellular processes across different modalities. Numerous algorithms, methodologies, and tools have been developed to use prior knowledge networks in the analysis of omics datasets. Interestingly, it has been repeatedly demonstrated that the source of prior knowledge can greatly impact the results of a given analysis. For these methods to be successful it is paramount that their selection of prior knowledge networks is amenable to the data type and the computational task they aim to accomplish. Here we present a five-level framework that broadly describes network models in terms of their scope, level of detail, and ability to inform causal predictions. To contextualize this framework, we review a handful of network-based omics analysis methods at each level, while also describing the computational tasks they aim to accomplish.
Multiomics approaches to studying systems biology are very powerful tools that can elucidate changes in the genomic, transcriptomic, proteomic, and metabolomic levels within a particular cell type in response to an infection. These approaches are valuable for understanding the mechanisms behind disease pathogenesis, and specifically how the immune system responds to being challenged. With the emergence of the COVID-019 pandemic, now more than ever, the importance and utility of these tools has become evident in garnering a better understanding of the systems biology within the innate and adaptive immune response and for developing treatments and preventative measures for new and emerging pathogens that pose a threat to human health. In this review we focus on the various state of the art “omics” technologies used within the scope of innate immunity.
Top-down proteomics (TDP) directly analyzes intact proteins and thus provides more comprehensive qualitative and quantitative proteoform-level information than conventional bottom-up proteomics that relies on digested peptides and protein inference. While significant advancements have been made in TDP in sample preparation, separation, instrumentation, and data analysis, reliable and reproducible data analysis still remains one of the major bottlenecks in TDP. A key step for robust data analysis is the establishment of an objective estimation of proteoform-level false discovery rate (FDR) in proteoform identification. The most widely used FDR estimation scheme is based on the target-decoy approach (TDA), which has primarily been established for bottom-up proteomics. We present evidence that the TDA-based FDR estimation may not work at the proteoform-level due to an overlooked factor, namely the erroneous deconvolution of precursor masses, which leads to incorrect FDR estimation. We argue that the conventional TDA-based FDR in proteoform identification is in fact protein-level FDR rather than proteoform-level FDR unless precursor deconvolution error rate is taken into account. To address this issue, we propose a formula to correct for proteoform-level FDR bias by combining TDA-based FDR and precursor deconvolution error rate.
HNF4α is a master regulator gene belonging to the nuclear receptor superfamily involved in regulating a wide range of critical biological processes in different organs. Structurally, the HNF4A locus is organized with two independent promoters and is subjected to alternative splicing with the production of twelve distinct isoforms. Little is known about the mechanisms each isoform uses to regulate transcription and their biological impact, with some reports addressing these aspects. Proteomic analyses have led to identifying proteins that interact with specific HNF4α isoforms. The identification and validation of these interactions and their role in co-regulating targeted gene expression are essential to understand better the role of this transcription factor in different biological processes and pathologies. This review addresses the historical origin of HNF4α isoforms, some of the main functions of the P1 and P2 isoform subgroups and provide information on the most recent hot topic research on the nature and function of proteins associated with each of the isoforms in some biological contexts.
Although Top-down (TD) proteomics techniques, aimed at the analysis of intact proteins and proteoforms, are becoming increasingly popular, efforts are needed at different levels to generalise its adoption. In this context, there are numerous improvements that are possible in the area of open science including the FAIR (Findability, Accessibility, Interoperability and Reusability) data principles. These include e.g. increased data sharing practices and availability of tailored open data standards. Additionally, the field would benefit from the development of open analysis workflows that can enable e.g. data reuse of public datasets, something that is increasingly common in other proteomics fields. We present an open and modular platform for the analysis and visualisation of TD proteomics data called TopDownApp. It can be used as a flexible analysis platform, through the use of a common workflow engine, common data formats for input/output, and software containerisation. It can also serve as a tool for visual inspection through its simple setup. As a key point, it can also be used as a development platform for new tools through the use of Python, a modular design, software containerisation and common data formats. TopDownApp is open source and freely available at: https://github.com/mwalzer/TopDownApp.
Advances in proteogenomic technologies have revealed hundreds to thousands of translated small open reading frames (sORFs) that encode microproteins in genomes across evolutionary space. While many microproteins have now been shown to play critical roles in biology and human disease, a majority of recently identified microproteins have little or no experimental evidence regarding their functionality. Computational tools have some limitations for analysis of short, poorly conserved microprotein sequences, so additional tools are needed to determine the role of each member of this recently discovered polypeptide class. A currently underexplored avenue in the study of microproteins is structure prediction and determination, which delivers a depth of functional information. In this review, we provide a brief overview of microprotein discovery methods, then examine examples of microprotein structures (and, conversely, intrinsic disorder) that have been experimentally determined using crystallography, cryo-electron microscopy, and NMR, which provide insight into their molecular functions and mechanisms. Additionally, we discuss examples of predicted microprotein structures that have provided insight or context regarding their function. Analysis of microprotein structure at the angstrom level, and confirmation of predicted structures, therefore, has potential to identify translated microproteins that are of biological importance and to provide molecular mechanism for their in vivo roles.
Cancer-associated cachexia is a wasting syndrome that results in dramatic loss of whole-body weight, predominantly due to loss of skeletal muscle mass. It has been established that cachexia inducing cancer cells secrete proteins and extracellular vesicles (EVs) that can induce muscle atrophy. Though several studies examined these cancer-cell derived factors, targeting some of these components have shown little or no clinical benefit. To develop new therapies, understanding of the dysregulated proteins and signalling pathways that regulate catabolic gene expression during muscle wasting is essential. Here, we sought to examine the effect of conditioned media (CM) that contain secreted factors and EVs from cachexia inducing C26 colon cancer cells on C2C12 myotubes using mass spectrometry-based label-free quantitative proteomics. We identified significant changes in the protein profile of C2C12 cells upon exposure to C26-derived CM. Functional enrichment analysis revealed enrichment of proteins associated with inflammation, mitochondrial dysfunction, muscle catabolism, ROS production, and ER stress in CM treated myotubes. Furthermore, strong downregulation in muscle structural integrity and development and/or regenerative pathways were observed. Together, these enriched proteins in atrophied muscle could be utilized as potential muscle wasting markers and the dysregulated biological processes could be employed for therapeutic benefit in cancer-induced muscle wasting.
Due to their oftentimes ambiguous nature, phosphopeptide positional isomers can present challenges in bottom-up mass spectrometry-based workflows as search engine scores alone are often not enough to confidently distinguish them. Additional scoring algorithms can remedy this by providing confidence metrics in addition to these search results, reducing ambiguity. Here we describe challenges to interpreting phosphoproteomics data and review several different approaches to determine sites of phosphorylation for both data-dependent and data-independent acquisition-based workflows. Finally, we discuss open questions regarding neutral losses, gas-phase rearrangement, and false localization rate estimation experienced by both types of acquisition workflows and best practices for managing ambiguity in phosphosite determination.
Most proteins function by forming complexes within a dynamic interconnected network that underlies various biological mechanisms. To systematically investigate such interactomes, high-throughput techniques including CF-MS have been developed to capture, identify, and quantify protein-protein interactions (PPIs) in large-scale. Compared to other techniques, CF-MS allows the global identification and quantification of native protein complexes in one setting, without genetic manipulation and overexpression. Furthermore, quantitative CF-MS can potentially elucidate the distribution of a protein in multiple co-elution features, informing the stoichiometries and dynamics of a target protein complex. In this issue, Youssef et al. (Proteomics 2023, XX, XXXX-XXXX) combined multiplex CF-MS and an in-house algorithm to study the dynamics of the PPI network for Escherichia coli grown under ten different conditions. While the results demonstrated that while most proteins remained stable, the authors were able to detect disrupted interactions that were growth condition-specific. Further bioinformatics analyses also revealed biophysical properties and structural patterns that govern such a response.