loading page

Protein Embedding based Alignment
  • Benjamin Giovanni Iovino,
  • Yuzhen Ye
Benjamin Giovanni Iovino
Indiana University Bloomington Luddy School of Informatics Computing and Engineering
Author Profile
Yuzhen Ye
Indiana University Bloomington Luddy School of Informatics Computing and Engineering

Corresponding Author:[email protected]

Author Profile

Abstract

Despite of the many progresses with alignment algorithms, aligning divergent protein sequences including those sharing less than 20-35% pairwise identity (so called “twilight zone”) remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments. These matrices however do not work well within the twilight zone. We developed PEbA for Protein Embedding based Alignments. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on their embeddings from a protein language model. We tested PEbA on benchmark alignments and the results showed that PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over five times as well for pairs of sequences with <10% identity). We compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA even outperformed DEDAL, a recently developed deep learning model that was created specifically for aligning protein sequences, particularly on longer alignments and sequences with low pairwise identity. Our results suggested that general purpose protein language models provide useful contextual information for accurate protein alignments.