We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

q-bio.BM

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Quantitative Biology > Biomolecules

Title: Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information

Abstract: Motivation: Bridging the exponentially growing gap between the number of unlabeled and labeled proteins, a couple of works have adopted semi-supervised learning for protein sequence modeling. They pre-train a model with a substantial amount of unlabeled data and transfer the learned representations to various downstream tasks. Nonetheless, the current pre-training methods mostly rely on a language modeling task and often show limited performances. Therefore, a complementary protein-specific task for pre-training is necessary to better capture the information contained within unlabeled protein sequences.
Results: In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same family prediction. PLUS can be used to pre-train various model architectures. In this work, we mainly use PLUS to pre-train a recurrent neural network (RNN) and refer to the resulting model as PLUS-RNN. It advances state-of-the-art pre-training methods on six out of seven tasks, i.e., (1) three protein(-pair)-level classification, (2) two protein-level regression, and (3) two amino-acid-level classification tasks. Furthermore, we present results from our ablation studies and interpretation analyses to better understand the strengths of PLUS-RNN.
Availability: The codes and pre-trained models are available at this https URL
Comments: 9 pages
Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
Cite as: arXiv:1912.05625 [q-bio.BM]
  (or arXiv:1912.05625v3 [q-bio.BM] for this version)

Submission history

From: Seonwoo Min [view email]
[v1] Mon, 25 Nov 2019 10:12:10 GMT (330kb,D)
[v2] Mon, 3 Feb 2020 09:06:30 GMT (799kb,D)
[v3] Sat, 25 Apr 2020 03:58:33 GMT (797kb,D)

Link back to: arXiv, form interface, contact.