We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: Contrastive Code Representation Learning

Abstract: Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines by up to 8% and top-1 accuracy of type inference baselines by up to 13%. Overall, ContraCode achieves 9% higher top-1 and 40% higher top-5 accuracy than the current state-of-the-art static type analyzer for TypeScript.
Comments: Code available at this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Machine Learning (stat.ML)
Cite as: arXiv:2007.04973 [cs.LG]
  (or arXiv:2007.04973v2 [cs.LG] for this version)

Submission history

From: Paras Jain [view email]
[v1] Thu, 9 Jul 2020 17:59:06 GMT (362kb,D)
[v2] Fri, 9 Oct 2020 05:30:35 GMT (418kb,D)
[v3] Thu, 15 Apr 2021 17:58:44 GMT (3148kb,D)
[v4] Thu, 6 Jan 2022 19:18:09 GMT (3181kb,D)

Link back to: arXiv, form interface, contact.