Contrastive Code Representation Learning

Jain, Paras; Jain, Ajay; Zhang, Tianjun; Abbeel, Pieter; Gonzalez, Joseph E.; Stoica, Ion

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2007

Computer Science > Machine Learning

Title: Contrastive Code Representation Learning

Authors: Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica

(Submitted on 9 Jul 2020 (v1), revised 9 Oct 2020 (this version, v2), latest version 6 Jan 2022 (v4))

Abstract: Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines by up to 8% and top-1 accuracy of type inference baselines by up to 13%. Overall, ContraCode achieves 9% higher top-1 and 40% higher top-5 accuracy than the current state-of-the-art static type analyzer for TypeScript.

Comments:	Code available at this https URL
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Machine Learning (stat.ML)
Cite as:	arXiv:2007.04973 [cs.LG]
	(or arXiv:2007.04973v2 [cs.LG] for this version)

Submission history

From: Paras Jain [view email]
[v1] Thu, 9 Jul 2020 17:59:06 GMT (362kb,D)
[v2] Fri, 9 Oct 2020 05:30:35 GMT (418kb,D)
[v3] Thu, 15 Apr 2021 17:58:44 GMT (3148kb,D)
[v4] Thu, 6 Jan 2022 19:18:09 GMT (3181kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2007.04973v2

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Contrastive Code Representation Learning

Submission history