Pengtao Xie - Authorea

Pengtao Xie

Public Documents 10

Learning by Ignoring, with Application to Domain Adaptation

Pengtao Xie

and 2 more

March 13, 2021

Learning by ignoring, which identifies less important things and excludes them from the learning process, is broadly practiced in human learning and has shown ubiquitous effectiveness. There has been psychological studies showing that learning to ignore certain things is a powerful tool for helping people focus. In this paper, we explore whether this useful human learning methodology can be borrowed to improve machine learning. We propose a novel machine learning framework referred to as learning by ignoring (LBI). Our framework automatically identifies pretraining data examples that have large domain shift from the target distribution by learning an ignoring variable for each example and excludes them from the pretraining process. We formulate LBI as a three-level optimization framework where three learning stages are involved: pretraining by minimizing the losses weighed by ignoring variables; finetuning; updating the ignoring variables by minimizing the validation loss. A gradient-based algorithm is developed to efficiently solve the three-level optimization problem in LBI. Experiments on various datasets demonstrate the effectiveness of our framework.

Learning by Teaching, with Application to Neural Architecture Search

Parth Sheth

and 2 more

March 13, 2021

Learning by teaching is a broadly used methodology in human learning and shows great effectiveness in improving learning outcome: a learner deepens his/her understanding of a topic by teaching this topic to others. We are interested in investigating whether this powerful learning technique can be borrowed from humans to improve the learning abilities of machines. We propose a novel machine learning approach called learning by teaching (LBT). In our approach, the teacher creates a pseudo-labeled dataset and uses it to train a student model. Based on how the student performs on the validation dataset, the teacher re-learns its model and re-teaches the student until the student achieves great validation performance. We propose a multi-level optimization framework to formulate LBT which involves three learning stages: teacher learns; teacher teaches student; teacher and student validate themselves. We develop an efficient algorithm to solve the LBT problem. We apply our approach to neural architecture search on CIFAR-100, CIFAR-10, and ImageNet, where the results demonstrate the effectiveness of our method.

Learning by Passing Tests, with Application to Neural Architecture Search

Xuefeng Du

and 2 more

March 13, 2021

Learning through tests is a broadly used methodology in human learning and shows great effectiveness in improving learning outcome: a sequence of tests are made with increasing levels of difficulty; the learner takes these tests to identify his/her weak points in learning and continuously addresses these weak points to successfully pass these tests. We are interested in investigating whether this powerful learning technique can be borrowed from humans to improve the learning abilities of machines. We propose a novel learning approach called learning by passing tests (LPT). In our approach, a tester model creates increasingly more-difficult tests to evaluate a learner model. The learner tries to continuously improve its learning ability so that it can successfully pass however difficult tests created by the tester. We propose a multi-level optimization framework to formulate LPT, where the tester learns to create difficult and meaningful tests and the learner learns to pass these tests. We develop an efficient algorithm to solve the LPT problem. Our method is applied for neural architecture search and achieves significant improvement over state-of-the-art baselines on CIFAR-100, CIFAR-10, and ImageNet.

ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein 3D Structures

Han Guo

and 3 more

May 26, 2023

The study of proteins is critical in various scientific disciplines, but understanding their complex structure-function relationships remains challenging. Recent advancements in large language models (LLMs) have demonstrated their ability to comprehend task-specific knowledge, suggesting the potential for specially trained ChatGPT-like systems to accelerate protein research. In this work, we introduce ProteinGPT, a prototype model aimed at learning and understanding protein 3D structures. ProteinGPT enables users to upload proteins, ask questions, and engage in interactive conversations to gain insights. The ProteinChat system consists of three main components: a composite encoder block, a projection layer, and an LLM. The protein undergoes encoding to form a protein embedding, which is then projected to conform with the LLM. The LLM combines user questions with the embedding to generate informative answers. To train ProteinChat, we curated the RCSB-PDB Protein Description Dataset, comprising 143,508 protein-description pairs from publicly available sources. By leveraging the capabilities of ProteinGPT, researchers can potentially expedite their investigations into protein functionalities and structures, benefiting areas such as drug development and therapeutics, food science and nutrition, and various aspects of our lives. This initial step lays the foundation for further exploration and utilization of the ChatGPT-like system in protein research. The code is available at \url{https://github.com/UCSD-AI4H/proteinchat} and the dataset can be downloaded from \url{https://drive.google.com/file/d/1AeJW5BY5C-d8mKJjAULTax6WA4hzWS0N/view?usp=share_link}.

DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs

Youwei Liang

and 3 more

May 22, 2023

A ChatGPT-like system for drug compound analysis could be a game-changer in pharmaceutical research, accelerating drug discovery, enhancing our understanding of structure-activity relationships, guiding lead optimization, aiding drug repurposing, reducing the failure rate, and streamlining clinical trials. In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on drug molecule graphs, by developing a prototype system DrugChat. DrugChat works in a similar way as ChatGPT. Users upload a compound molecule graph and ask various questions about this compound. DrugChat will answer these questions in a multi-turn, interactive manner. The DrugChat system consists of a graph neural network (GNN), a large language model (LLM), and an adaptor. The GNN takes a compound molecule graph as input and learns a representation for this graph. The adaptor transforms the graph representation produced by the GNN into another representation that is acceptable to the LLM. The LLM takes the compound representation transformed by the adaptor and users’ questions about this compound as inputs and generates answers. All these components are trained end-to-end. To train DrugChat, we collected instruction tuning datasets which contain 10,834 drug compounds and 143,517 question-answer pairs. The code and data is available at https://github.com/UCSD-AI4H/drugchat

Skillearn: Machine Learning Inspired by Humans’ Learning Skills

Pengtao Xie

and 2 more

December 12, 2020

Humans, as the most powerful learners on the planet, have accumulated a lot of learning skills, such as learning through tests, interleaving learning, self-explanation, active recalling, to name a few. These learning skills and methodologies enable humans to learn new topics more effectively and efficiently. We are interested in investigating whether humans’ learning skills can be borrowed to help machines to learn better. Specifically, we aim to formalize these skills and leverage them to train better machine learning (ML) models. To achieve this goal, we develop a general framework – Skillearn, which provides a principled way to represent humans’ learning skills mathematically and use the formally-represented skills to improve the training of ML models. In two case studies, we apply Skillearn to formalize two learning skills of humans: learning by passing tests and interleaving learning, and use the formalized skills to improve neural architecture search. Experiments on various datasets show that trained using the skills formalized by Skillearn, ML models achieve significantly better performance.

Pathological Visual Question Answering

Xuehai He

and 6 more

October 25, 2020

We develop datasets and methods to perform visual question answering on pathology images.

Differentially-private Federated Neural Architecture Search

Ishika Singh

and 5 more

June 19, 2020

Neural architecture search, which aims to automatically search for architectures (e.g., convolution, max pooling) of neural networks that maximize validation performance, has achieved remarkable progress recently. In many application scenarios, several parties would like to collaboratively search for a shared neural architecture by leveraging data from all parties. However, due to privacy concerns, no party wants its data to be seen by other parties. To address this problem, we propose federated neural architecture search (FNAS), where different parties collectively search for a differentiable architecture by exchanging gradients of architecture variables without exposing their data to other parties. To further preserve privacy, we study differentially-private FNAS (DP-FNAS), which adds random noise to the gradients of architecture variables. We provide theoretical guarantees of DP-FNAS in achieving differential privacy. Experiments show that DP-FNAS can search highly-performant neural architectures while protecting the privacy of individual parties. The code is available at https://github.com/UCSD-AI4H/DP-FNAS

Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms

Xingyi Yang

and 5 more

June 22, 2020

Pretraining has become a standard technique in computer vision and natural language processing, which usually helps to improve performance substantially. Previously, the most dominant pretraining method is transfer learning (TL), which uses labeled data to learn a good representation network. Recently, a new pretraining approach – self-supervised learning (SSL) – has demonstrated promising results on a wide range of applications. SSL does not require annotated labels. It is purely conducted on input data by solving auxiliary tasks defined on the input data examples. The current reported results show that in certain applications, SSL outperforms TL and the other way around in other applications. There has not been a clear understanding on what properties of data and tasks render one approach outperforms the other. Without an informed guideline, ML researchers have to try both methods to find out which one is better empirically. It is usually time-consuming to do so. In this work, we aim to address this problem. We perform a comprehensive comparative study between SSL and TL regarding which one works better under different properties of data and tasks, including domain difference between source and target tasks, the amount of pretraining data, class imbalance in source data, and usage of target data for additional pretraining, etc. The insights distilled from our comparative studies can help ML researchers decide which method to use based on the properties of their applications.

CERT: Contrastive Self-supervised Learning for Language Understanding

Hongchao Fang

and 1 more

May 21, 2020

Pretrained language models such as BERT, GPT have shown great effectiveness in language understanding. The auxiliary predictive tasks in existing pretraining approaches are mostly defined on tokens, thus may not be able to capture sentence-level semantics very well. To address this issue, we propose CERT: Contrastive self-supervised Encoder Representations from Transformers, which pretrains language representation models using contrastive self-supervised learning at the sentence level. CERT creates augmentations of original sentences using back-translation. Then it finetunes a pretrained language encoder (e.g., BERT) by predicting whether two augmented sentences originate from the same sentence. CERT is simple to use and can be flexibly plugged into any pretraining-finetuning NLP pipeline. We evaluate CERT on three language understanding tasks: CoLA, RTE, and QNLI. CERT outperforms BERT significantly.