Current browse context:
q-bio.BM
Change to browse by:
References & Citations
Quantitative Biology > Biomolecules
Title: Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
(Submitted on 25 Nov 2019 (v1), revised 25 Apr 2020 (this version, v3), latest version 16 Sep 2021 (v4))
Abstract: Motivation: Bridging the exponentially growing gap between the number of unlabeled and labeled proteins, a couple of works have adopted semi-supervised learning for protein sequence modeling. They pre-train a model with a substantial amount of unlabeled data and transfer the learned representations to various downstream tasks. Nonetheless, the current pre-training methods mostly rely on a language modeling task and often show limited performances. Therefore, a complementary protein-specific task for pre-training is necessary to better capture the information contained within unlabeled protein sequences.
Results: In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same family prediction. PLUS can be used to pre-train various model architectures. In this work, we mainly use PLUS to pre-train a recurrent neural network (RNN) and refer to the resulting model as PLUS-RNN. It advances state-of-the-art pre-training methods on six out of seven tasks, i.e., (1) three protein(-pair)-level classification, (2) two protein-level regression, and (3) two amino-acid-level classification tasks. Furthermore, we present results from our ablation studies and interpretation analyses to better understand the strengths of PLUS-RNN.
Availability: The codes and pre-trained models are available at this https URL
Submission history
From: Seonwoo Min [view email][v1] Mon, 25 Nov 2019 10:12:10 GMT (330kb,D)
[v2] Mon, 3 Feb 2020 09:06:30 GMT (799kb,D)
[v3] Sat, 25 Apr 2020 03:58:33 GMT (797kb,D)
[v4] Thu, 16 Sep 2021 23:13:47 GMT (2797kb,D)
Link back to: arXiv, form interface, contact.