Current browse context:
cs.LG
Change to browse by:
References & Citations
Computer Science > Machine Learning
Title: Contrastive Code Representation Learning
(Submitted on 9 Jul 2020 (v1), revised 9 Oct 2020 (this version, v2), latest version 6 Jan 2022 (v4))
Abstract: Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines by up to 8% and top-1 accuracy of type inference baselines by up to 13%. Overall, ContraCode achieves 9% higher top-1 and 40% higher top-5 accuracy than the current state-of-the-art static type analyzer for TypeScript.
Submission history
From: Paras Jain [view email][v1] Thu, 9 Jul 2020 17:59:06 GMT (362kb,D)
[v2] Fri, 9 Oct 2020 05:30:35 GMT (418kb,D)
[v3] Thu, 15 Apr 2021 17:58:44 GMT (3148kb,D)
[v4] Thu, 6 Jan 2022 19:18:09 GMT (3181kb,D)
Link back to: arXiv, form interface, contact.