Han Guo

and 3 more

The study of proteins is critical in various scientific disciplines, but understanding their complex structure-function relationships remains challenging. Recent advancements in large language models (LLMs) have demonstrated their ability to comprehend task-specific knowledge, suggesting the potential for specially trained ChatGPT-like systems to accelerate protein research. In this work, we introduce ProteinGPT, a prototype model aimed at learning and understanding protein 3D structures. ProteinGPT enables users to upload proteins, ask questions, and engage in interactive conversations to gain insights. The ProteinChat system consists of three main components: a composite encoder block, a projection layer, and an LLM. The protein undergoes encoding to form a protein embedding, which is then projected to conform with the LLM. The LLM combines user questions with the embedding to generate informative answers. To train ProteinChat, we curated the RCSB-PDB Protein Description Dataset, comprising 143,508 protein-description pairs from publicly available sources. By leveraging the capabilities of ProteinGPT, researchers can potentially expedite their investigations into protein functionalities and structures, benefiting areas such as drug development and therapeutics, food science and nutrition, and various aspects of our lives. This initial step lays the foundation for further exploration and utilization of the ChatGPT-like system in protein research. The code is available at \url{https://github.com/UCSD-AI4H/proteinchat} and the dataset can be downloaded from \url{https://drive.google.com/file/d/1AeJW5BY5C-d8mKJjAULTax6WA4hzWS0N/view?usp=share_link}.

Youwei Liang

and 3 more

Xingyi Yang

and 5 more

Pretraining has become a standard technique in computer vision and natural language processing, which usually helps to improve performance substantially. Previously, the most dominant pretraining method is transfer learning (TL), which uses labeled data to learn a good representation network. Recently, a new pretraining approach – self-supervised learning (SSL) – has demonstrated promising results on a wide range of applications. SSL does not require annotated labels. It is purely conducted on input data by solving auxiliary tasks defined on the input data examples. The current reported results show that in certain applications, SSL outperforms TL and the other way around in other applications. There has not been a clear understanding on what properties of data and tasks render one approach outperforms the other. Without an informed guideline, ML researchers have to try both methods to find out which one is better empirically. It is usually time-consuming to do so. In this work, we aim to address this problem. We perform a comprehensive comparative study between SSL and TL regarding which one works better under different properties of data and tasks, including domain difference between source and target tasks, the amount of pretraining data, class imbalance in source data, and usage of target data for additional pretraining, etc. The insights distilled from our comparative studies can help ML researchers decide which method to use based on the properties of their applications.