∎
22email: {xiaodong.gu, cuinan, bjshen}@sjtu.edu.cn
Zero-Shot Code Representation Learning via Prompt Tuning
Abstract
Learning code representations has been the core prerequisite of many software engineering tasks such as code clone detection and code generation. The state-of-the-art pre-trained language models (PLMs) such as CodeBERT require an amount of downstream data for fine-tuning. However, gathering training samples can be prohibitively expensive and impractical for domain-specific languages or project-specific tasks. Besides, pre-training and downstream tasks are usually heterogeneous, which make it hard to fully explore the knowledge learned during pre-training. In this paper, we propose Zecoler, a zero-shot approach for learning code representations. Zecoler is built upon a pre-trained programming language model. In order to elicit knowledge from the PLMs efficiently, Zecoler casts the downstream tasks to the same form of pre-training objectives by inserting trainable prompts into the original input. These prompts can guide PLMs on how to generate better results. Subsequently, we employ the prompt tuning technique to search for the optimal prompts for PLMs automatically. This enables the representation model to efficiently fit the downstream tasks through fine-tuning on the dataset in source language domain and then reuse the pre-trained knowledge for the target domain in a zero-shot style. We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation. We experiment in multiple programming languages without giving labeled samples, e.g., Solidity and Go, with model trained in corpora of common languages such as Java. The results show that our approach significantly outperforms baseline models under the zero-shot setting. For example, the accuracy of code search is improved by 30% compared to fine-tuning. In addition, qualitative analysis demonstrates its superior generalizability under both cross-lingual and monolingual few-shot settings.
Keywords:
Learning Code Representations Zero-Shot Learning Prompt Tuning Program Understanding and Generation1 Introduction
Deep learning models have been widely applied to a variety of software engineering tasks, such as clone detection (FangLS0S20), code summarization (ChoiBNL21), and code search (HaldarWXH20). In order to apply deep learning to these tasks, source code needs to be represented as vectors that reflect their deep semantics. For example, in the clone detection task, code representations can be used to identify similar features between two code snippets (ZhangHZWLS21).
Pre-trained language models (PLMs) of code such as CodeBERT (codebert), CodeT5 (codet5) and PL-BART (plbart) have been the cutting-edge code representation technology. A code PLM is pre-trained to learn code representations on large-scale code corpora with self-supervised objectives, and then is fine-tuned to adapt to downstream tasks. They have demonstrated a better understanding of the semantics of source code than previous deep learning models such as code2vec (code2vec) and ASTNN (zhang2019astnn).
Challenges. Despite showing promising results, fine-tuning a PLM on a specific task is challenging. First, its performance relies on the availability of sufficient training data for downstream tasks. However, in practice, the labeled data for downstream tasks is far limited, especially for domain-specific languages or project-specific tasks. For example, Solidity is a new language that is specifically designed for smart contracts. Labeling Solidity code requires much domain knowledge on Blockchain, which is often costly and laborious. Also, the collected data is significantly redundant (ChenLZ0Z21). This restricts the collection of supervised data and causes poor representations learned by the model (codenet).
Second, the pre-training tasks (e.g., masked language model (MLM)) are usually heterogeneous from downstream tasks such as code search. As such, the reusability of prior knowledge learned in the pre-training phase may be limited in the fine-tuning phase. This is even more challenging when there is no or insufficient training data for downstream tasks in domain-specific languages. Large PLMs can easily overfit scarce data, which lead to poor task fitting. Hence, an efficient mechanism to elicit knowledge from PLMs to the downstream tasks in the zero-shot scenario is highly demanded.
Our work. In this paper, we propose Zecoler (Zero-shot code representation learning), a novel approach for learning code representations for a language that has no labelled data samples. The key idea is to transfer the representations of a programming language with sufficient data (i.e., source language such as Java) into a target language that has few training samples (e.g., Solidity). Specifically, we adapt prompt-based learning, a new learning paradigm for PLMs, that continually trains a PLM in the source language and then transfers the model to tasks in the target language.
First, by accompanying trainable prompt tokens with the PLM input, Zecoler adapts the downstream task to the same form as that in pre-training. For example, code clone detection can be converted to an MLM task by inserting prompt and “[MASK]” tokens into the input. Then, to optimize the prompt, we learn a continuous task-specific vector on the dataset of downstream task in source language. In this way, the model can be adapted to the target language and the prompts further guide it to efficiently elicit knowledge of programming languages learned during pre-training.
To evaluate the proposed approach, we experiment on five classification and generative tasks, including code clone detection, code search, method name prediction, code summarization, and code generation. The results show that our approach is substantially effective in zero-shot learning of code representations. The accuracy of the three classification tasks in Solidity is 79.8%, 67.1%, and 68.1%, respectively, which is around 14.7% greater than the strong baseline CodeBERT. The two generative tasks also gain visible improvement in terms of BLEU/CodeBLEU and Rouge-L.
This paper extends our preliminary study, which appears as a research paper in ICPC (ZeroShot). In particular, we extend our preliminary work in the following directions:
-
1.
We apply zero-shot learning of code representations to code generative tasks that can be modeled by the sequence-to-sequence framework.
-
2.
We provide a more in-depth empirical study to investigate the effectiveness of our approach in code summarization and code generation tasks.
The main contributions of this paper, as a super-set of our preliminary study, are summarized as follows:
-
•
To the best of our knowledge, we are the first to propose zero-shot learning of code representations, which does not require task-relevant training data of the target language for fine-tuning.
-
•
We propose a prompt-based learning method for zero-shot code representation that can generalize to both classification and generative tasks.
-
•
We conduct extensive experiments to evaluate the proposed approach on five code intelligence tasks. Results show that our approach significantly outperforms the baseline models.
The rest of the paper is organized as follows. Section 2 provides background knowledge on pre-trained models and zero-shot learning. Section 3 presents the technical details of Zecoler. The experimental setup is explained in Section 4 and the results are analyzed in Section 5. We discuss the threats to the validity in Section 6 and compare Zecoler against related work in Section 7. Finally, we conclude our study and mention future work in Section 8.
2 Background
2.1 Pre-trained Language Models for Code
Pre-trained language models (PLMs) such as BERT (DevlinCLT19), GPT, and T5 (RaffelSRLNMZLL20) have been shown to provide large improvements for a range of natural language tasks. The key idea of PLMs is to train a large model on vast corpora and use the resulting representations on tasks for which only limited amounts of labeled data is available.
A PLM is pre-trained on a large-scale text corpora through a series of self-supervised learning tasks, e.g., masked language modeling (MLM) and next sentence prediction (NSP). The MLM task masks a random portion of tokens in the input text and tries to predict the masked words, while the NSP task predicts whether or not two given input segments are coherent. These pre-training tasks enable PLMs to learn the general knowledge from large language corpora. Then the PLM is fine-tuned on a task-specific dataset. A fine-tuning header on top of the PLM is optimized via supervised learning tasks in a specific domain. This header is usually a trainable neural network such as an MLP for classification tasks or a Transformer decoder for generative tasks.
Given the great success of PLMs in NLP, researchers also seek the adaptations of PLMs for programming languages (codebert; codet5; codex). They customize the pre-training objectives using programming related tasks on a big code corpus. PLMs have been successfully used to learn code representations and further be adapted in a variety of code intelligence tasks, e.g., clone detection (FangLS0S20), code summarization (ChoiBNL21), and code search (HaldarWXH20).
For example, CodeBERT is built on RoBERTa (roberta) and is pre-trained with both natural and programming languages. Figure 1 illustrates the main pipeline of CodeBERT. The model is first pre-trained on two tasks, namely, MLM and replaced token detection (RTD). The MLM task randomly masks the tokens in natural language and programming language (NL-PL) pairs and trains the model to predict the original words. The RTD task trains the model to detect whether a given token is original or generated by the model. Having pre-trained, the pre-trained model is fine-tuned on data of downstream tasks such as clone detection and code search. A fine-tuning header is added to the PLM and is optimized with downstream tasks.
2.2 Zero-Shot Learning
The standard supervised learning approaches train a model with large-scale labelled samples. However, in many tasks such as recognizing name of a new brand or translating a new language, obtaining sufficient training samples is laborious and often impracticable. Zero-shot learning transfers a learned model from a source domain to a target domain that has no labelled data, and hence alleviates this “data hungry” problem. It can be realized through a variety of techniques such as data augmentation (BorneaPRFS21), meta-learning (metalearning1; metalearning2), PLMs (DevlinCLT19), and prompt-based learning (gpt3).
Data augmentation. A direct technique of zero-shot learning is data augmentation, namely, enlarging the data set (e.g., randomly inserting samples and noise) so that the model can have sufficient data samples for training (BorneaPRFS21).
Meta-learning. Another popular strategy for zero-shot learning is meta-learning. Meta learning is also known as “learning to learn”, which aims at training a meta learner which learns the update rules of the target model (GuWCLC18). This enables a machine learning model to achieve competitive performance even with scarce data. However, meta-learning focuses on learning strategies instead of representations. Hence it will be difficult to be generalized across different code intelligence tasks.
Pre-trained Language Models. PLMs are pre-trained on large-scale text corpora to learn common knowledge of the languages, and can be generalized to specific tasks with only a few training examples. However, PLMs need a fine-tuning phase which continually train the pre-trained model on the downstream tasks. Fine-tuning requires the availability of manually labelled datasets which is laborious and expensive.
Prompt-based Learning. To alleviate the data hungry problem of fine-tuning, GPT3 (gpt3) introduces prompt-based learning, a lightweight alternative of fine-tuning for PLMs. A prompt is usually a piece of text inserted in the input to guide the pre-trained model to generate desirable results. For example, one can prepend “TL;DR” followed by a few examples in the input of GPT3 to let it summarize the input.
Unlike fine-tuning, which adds fine-tuning header and re-optimizes the PLM using downstream tasks, the prompt-based learning approach converts downstream tasks (e.g., method name detection) to the same form as the pre-training tasks (e.g., MLM) by injecting “prompts” and “[MASK]” to the PLM input. Hence, the PLM can generate the desired results with minimal adjustment. This encourages downstream tasks to reuse the knowledge from the PLM more efficiently.
3 Approach
3.1 Problem Definition and Analysis
Definition 1 (Code Representation Learning). Code representation learning aims at representing source code as vectors. Let denote a code snippet with tokens. A function is learned to map into a -dimensional vector which contains the semantics of (BengioCV13), namely,
(1) |
is a function parameterized by , which can be implemented using deep neural networks such as the fully-connected networks (code2vec), LSTM (MahtoVTH21) and Transformers (transformer).
The learned program vectors can further be taken as input to machine learning models for code intelligence tasks, such as code clone detection, code search, and method name prediction.
Definition 2 (Code Classification Task). Given two code or text fragments and , a code classification task aims to predict a category that represents their relationship:
(2) |
where denotes the neural classification model. Most of the code understanding tasks such as code clone detection, code search, and method name prediction can be formulated as a code classification task in a unified way. For example, in the code clone detection task, and stand for two code snippets and stands for whether they contain clones. In the code search task, and stand for a code snippet and a natural language description, respectively, and stands for whether they are semantically correlated.
State-of-the-art code representation learning techniques usually leverage the pre-training and fine-tuning paradigm: a Transformer is firstly pre-trained on large unlabeled code corpora using self-supervised objectives and is subsequently fine-tuned on labelled data of code classification tasks.
Definition 3 (Zero-Shot Code Representation Learning). The goal of zero-shot code representation learning is to generate semantic representations of an unseen programming language (target language) without requiring the task-specific data. This can be achieved by reusing semantic representations of a seen programming language (source language). Let be a source language and be the target language. Zero-shot code representation learning aims to transfer the parameters from to , where the former is trained on the task-specific source language training data.
couldn’t be used for the representation of directly since there is a lexical gap between and . However, both the source and the target language PLMs are pre-trained on large-scale unlabelled code corpora with self-supervised objectives such as the MLM for a PLM. Hence, it is feasible to bridge their representation gap by using the common knowledge learned in pre-training. Based on this idea, we cast the problem in Equation 2 as a pre-training task for PLM and train the model on (, ), so that the PLM can also predict reasonable results for samples in seamlessly.
3.2 Model Architecture
Figure 2 illustrate the overall architecture of Zecoler. The pipeline is comprised of three steps:
-
1)
We first cast any downstream task to the pre-training task (e.g., MLM) by inserting trainable prompts and a “[MASK]” token into the input of the task. The added prompts play the role of guiding the PLM to elicit the knowledge learned in pre-training and predict the right answer (§3.3).
-
2)
Taking the resulting data as input, a PLM is then continually trained on the source language dataset of downstream tasks, that is, infers the code representation of the input and predicts the word for the “[MASK]” token. The optimal prompts are searched in the word embedding space automatically (§3.4).
-
3)
Finally, we take unlabelled dataset in the target language as the input to the PLM and let the model predict the answer without training. A verbalizer is employed to cast the predicted word to class labels (§3.5).
3.3 Casting Downstream Tasks
Our first step is to cast the downstream task (Equation 2) into the MLM task. We concatenate two input snippets and of the downstream task, which can better capture the relationship between them (nspbert). Like that in the MLM task, we also insert an “[MASK]” token into the concatenated input. The “[MASK]” token acts as a placeholder which steers the pre-trained model to generate the classification result in the code intelligence task. It is notable that the position of the “[MASK]” token is a hyperparameter and we append it at the tail of the input by default. The masked sequence
(3) |
is taken as input to the PLM, which yields the hidden states
(4) |
for all tokens. Then, the hidden state corresponding to the masked token, namely, , is fed into an MLM header which predicts a token for the masked position:
(5) |
The MLM header is a fully connected neural network parameterized by that is optimized to minimize the cross-entropy loss:
(6) |
where denotes the ground-truth label of the code intelligence task, and is the vocabulary size.
3.4 Prompt-based Learning
The conventional fine-tuning method optimizes and in Equation 2 from scratch. This causes the model to overfit scarce task-specific data. Inspired by prompt-based learning (ptuning), we optimize the PLM by merely adjusting its input sequence. More specifically, we insert a number of pseudo tokens called prompts into the input sequence of the PLM, which coax the PLM to directly generate the predicted class label of the downstream task. By only adjusting the model input, the PLM needs far less optimization cost to fit for the data in the target task, while keeping the most of prior knowledge learned during pre-training.
Based on this idea, we design a number of prompt tokens and inject them into the masked sequence using a pre-defined template . Hence, the original inputs and are transformed into
(7) |
through template which contains m prompt tokens.
Like general words, these prompt tokens are embedded into trainable vectors and are continually pre-trained on downstream tasks of the source domain through gradient descend.
However, the number of trainable parameters for prompts are too small compared with that of the original PLM. This may cause the prompt representation vectors fall into local minima in gradient descend. To solve this problem, we train prompts using a bidirectional LSTM additionally.
In a zero-shot setting, there is no training sample in the target low-resource language. Instead, we continually pre-train the PLM and the trainable prompts using large-scale code corpora in popular languages (e.g., Java), and then directly apply the trained model to tasks in the low-resource language. More specifically, we train the PLM in the source domain through prompt-based learning. Then, we take as input data samples in the target domain into the same model without extra training, and obtain the results of the downstream tasks.
3.5 Reverting Outputs to Final Answer
The MLM task generates a token that is likely to fill into the masked position. In order to obtain the classification result, we need to revert the MLM predictions to classification labels of the downstream task. For this purpose, we employ a verbalizer (pet) which realizes such a reversion. Let be the vocabulary of the PLM and be the labels of the downstream task such as true, false. The verbalizer is defined as a function : that maps each candidate word in the vocabulary to a classification label. The choice of candidate words is arbitrary as long as they are sufficiently different. The model will be trained to map candidate words to true predictions. In our approach, we consider two candidate words as a candidate set and only inspect which word in is more likely to fill into the “[MASK]” position through PLM predictions. If the word “yes” has a higher probability to fill in the masked position, the verbalizer will map it to the label “true” and hence output a positive prediction for this task.
Take code clone detection as an example. Given two code snippets, the model constructs an input sequence by injecting a number of prompt tokens into the snippets, followed by a “[MASK]” token. The constructed sequence is fed into the PLM to predict the label , where “cloned”, “not cloned”. The MLM header of the PLM outputs the probability of each candidate word for the masked position. If the candidate word “yes” has a higher probability, the verbalizer will map it to the class label “cloned”, yielding the final prediction as “cloned”.
3.6 Zecoler for Generative Tasks
Besides classification tasks, we also explore zero-shot learning on generative tasks such as program generation and code summarization.
Definition 4 (Code Generative Task). A code generative task aims to generate the target sequence given an input code snippet or description :
(8) |
where denotes the probability of , denotes the representation learner and denotes a neural generative decoder. For instance, the code summarization task generates a brief summary for the given code .
Figure 3 illustrates the architecture of Zecoler for generative tasks. The model follows a conventional encoder-decoder architecture. The pipeline is comprised of two steps: First, the PLM encoder takes as input a code snippet or a description , prepends it with a prompt , and encodes as vectors. Then, the decoder generates the target sentence based on the encoded hidden vectors.
(9) |
The prompt is designed as a sequence of prefix tokens, which helps the PLM extract the knowledge learned in the pre-training phase. In code generation task, in order to specify the target language for guiding model to generate the correct program, we append a special token “language” to the original prompt where “language” indicates the target language name.
We select CodeBERT as the encoder, and a randomly-initialed Transformer as the decoder. Both the PLM and prompt are trained on the task-specific dataset in the source language and are then transferred to the generative task in the target language without extra training.
3.7 Training and Usage
Figure 4 shows the workflow of Zecoler. Zecoler follows the general paradigm of learning code representations. In the training phase, Zecoler is given a training set of labelled code snippets. For each snippet (pair), Zecoler augments it using a prompt template. The prompt-augmented code (pair) is taken as input to Zecoler which yields the prediction and calculates the loss function based on the ground-truth data.
In the usage phase, Zecoler is given a code snippet (pair) only. Zecoler augments it using the same prompt template as in the training phase and then gives the prediction for the downstream task.
4 Experimental Setup
4.1 Research Questions
We evaluate Zecoler by answering the following research questions:
-
•
RQ1: How effective is our approach in zero-shot code representation learning?
We evaluate the effectiveness of Zecoler in zero-shot code representation learning. We take Java as the source language and transfer the learned model to Solidity, a domain-specific languages, and Go, an up-and-coming language which are not provided with training samples. The experiments are conducted in three popular classification tasks.
-
•
RQ2: How effective is our approach in few-shot code representation learning?
In some programming languages and tasks, we can obtain scarce (e.g., 100) samples in the target language. We wonder whether Zecoler is also effective in these data when pre-trained on the source domain. Therefore, we provide the pre-trained model in RQ1 with a few samples of the target language and conduct the same experiments as in RQ1.
-
•
RQ3: How effective is our approach in monolingual code representation learning?
RQ1 and RQ2 mainly evaluate the effectiveness of Zecoler in a cross-language setting. We further explore how effective is our approach without transfer learning. We train the model in three languages in the few shot setting, and test the model in the same language. Besides Solidity, we also want to assess our approach in other languages such as Java and Go when training data is insufficient.
-
•
RQ4: How effective is our approach in zero-shot generative tasks?
Besides the classification tasks, we also investigate the effectiveness of Zecoler in generative tasks. Similar to RQ1, we take Java as the source language and take Solidity and Go as the target language. We also take JavaScript and Ruby into account as the examples of program languages in the specific projects. The experiments are conducted in code summarization and code generation tasks.
-
•
RQ5: How do different hyperparameters impact the performance of our approach?
We evaluate the performance of our approach under different hyperparameters. Specifically, we conduct ablation studies on prompt templates (number and position), source languages, and PLM scales.
4.2 Downstream Tasks
We evaluate our approach on three classification tasks and two generative tasks:
-
1)
Code Clone Detection (CD): a task that determines whether two code snippets are cloned or not (bigclonebench). A PLM based clone detection model takes as input two code snippets and outputs their representations. Then, a classification header is built on top of the representations and predicts whether the two code snippets are cloned (=1) or not (=0). There are four types of clones (clonedetection). Our approach can challenge type-3 and type-4 clones, that is, the two snippets are not textually identical, but implement the same functionality.
-
2)
Code Search (CS): a task that retrieves a semantically relevant code snippet for a given natural language query (codesearchnet). Following CodeBERT (codebert), we formulate code search as a classification problem. Given a natural language description of the code and a programming language code snippet, this task aims at determining whether this NL-PL pair is related. The binary answer is “related” or “not related”. The classification generates a probability score, which can be used for ranking results of code search.
-
3)
Method Name Prediction (MNP): a task that suggests the function name for a code snippet (ZhangCLP21). Similar to code search, we transform this task to a binary classification task (ComptonFPK20): given a code snippet, it enumerates all candidate function names (i.e., the vocabulary of code tokens) and constructs a “snippet, name” pair. The pair is taken as input to the PLM which outputs a binary prediction whether the name in the pair is “suitable” (=1) or “not suitable”(=0) for the code snippet.
-
4)
Code Summarization (CM): a task that generates a natural language summary for given source code (00030W0Z22). We fine-tune the PLM using a parallel dataset of PL-NL pairs.
-
5)
Code Generation (CG): a task that automatically generates source code for a natural language query (WangWWMLZLWJL22). Code generation is a challenging task because programs usually follow syntax rules while PLMs are good at generating sequential tokens. We fine-tune the PLM on a parallel set of NL-PL pairs.
4.3 Datasets
Datasets | Downstream Tasks * | Size | Programming Languages | ||||
CD | CS | MNP | CM | CG | |||
SCCD | 10,000 | Solidity | |||||
SCS | 347,410 | Solidity | |||||
CodeNet | 8,008,527 | Java, Go, C++, C, Python, Ruby, C#, … | |||||
CodeSearchNet | 2,000,000 | Java, Go, Python, JavaScript, Ruby, PHP |
-
*
CD = clone detection, CS = code search, MNP = method name prediction, CM = code summarization, CG = code generation.
We conduct our experiments on four datasets. Each dataset may be used for multiple downstream tasks. Table 1 shows the statistics of each dataset, including sizes, programming languages, and corresponding tasks.
Smart Contract Clone Detection (SCCD): a manually labeled clone detection dataset for the Solidity language. The dataset contains 10,000 data samples that are collected from EtherScan111https://etherscan.io/, an analytic platform for smart contracts. We build a web scraper to collect Solidity code and label the cloned pairs based on contract information such as contract address and opcode. Each data sample consists of a pair of code snippets that are cloned. One notable feature of this dataset is that most samples are type-3 and type-4 clones.
Smart Contract Summarization (SCS) (YangKYGWMZ21): a dataset that contains 347,410 code-comment pairs in the Solidity language. The dataset was originally collected for code summarization, and we preprocess it to fit for the code search and method name prediction tasks. To adapt to the code search task, we filter long code and remove code comments. For method name prediction, we separate method names from original code snippets.
CodeNet (codenet): a multilingual codebase built from two online judge websites, namely, AIZU222https://onlinejudge.u-aizu.ac.jp/introduction and AtCoder333https://atcoder.jp/. CodeNet contains 8,008,527 code submissions in multiple programming languages such as Java, Go, Ruby, and Python. We use this dataset for the code clone detection and code search tasks. To adapt the dataset to the code clone detection task, we label two code submissions as a cloned pair if they solve the same problem. To adapt the original data to the code search task, we extract (NL, PL) pairs from problem descriptions and their code submissions, respectively.
CodeSearchNet (codesearchnet): a widely used dataset for NL-PL and PL-NL tasks. The dataset involves 2,000,000 code snippets in six languages, namely, Java, Go, Python, JavaScript, Ruby and PHP. Each snippet is accompanied with a corresponding natural language description and the method name.
We preprocess these datasets by removing comments from the code, since they can interfere the final results in classification task experiments. Code snippets with more than 250 tokens are filtered out to fit for the PLMs. We also exclude code snippets with less than 125 tokens to accommodate downstream tasks. In order to prevent the model from being biased to one class in classification tasks, we balance the dataset with the same number (1:1) of positive and negative samples. The negative pairs () are created using random combinations of snippets from the positive data samples ().
4.4 Implementation Details
We implement our models on top of CodeBERT, a popular PLM which is built based on RoBERTa-base (H=768, A=12, L=12). CodeBERT learns representations of programming languages (Java, Python, JavaScript, PHP, Ruby, and Go) in the pre-training phase. We use the default tokenizer (i.e., Microsoft/codebert-base) of CodeBERT with a vocabulary size of 50,265. We set the maximum sequence length to 512. Our experimental implementation is based on the Huggingface Transformers444https://huggingface.co/microsoft/codebert-base and P-Tuning (ptuning). The batch size and the number of epochs are set to 10 and 20 in classification tasks, and 20 and 15 in generative tasks.
In classification tasks, we insert prompt tokens uniformly into the original input of CodeBERT since there are two text snippets in the input. The additional LSTM for training prompts has two hidden layers followed by two-layer multilayer perceptrons (MLP) activated by ReLU.
To generate target sequences, we employ a 6-layer Transformer decoder as a fine-tuning header. Since we append a special prompt “language” to the original input and change the input pattern, we continually pre-train CodeBERT using MLM task on 100,000 unlabelled code sampled from CodeSearchNet, each appended with a language mark. The batch size and number of epochs are set to 8 and 3, respectively.
All models are optimized using the AdamW (adamw) algorithm on a machine with a GeForce RTX 3090 Ti GPUs. The initial learning rate (lr) is set to 3e-5, which linearly increases from 0 during a warm-up period. The iteration number of the warm-up period equals to the number of the training steps in the first epoch. During the rest training process, the learning rate continuously decreases to 0. We measure the performance on the validation set during training. The checkpoint that achieves the best accuracy on the validation set are selected for testing.
4.5 Baseline Models
We compare our approach with five baseline models:
-
1)
AVG: a baseline approach that directly represents programs by averaging their token embeddings. We reuse token embeddings from CodeBERT and represent an input code snippet by the average of all token embeddings. Next, we fine-tune the classifier of downstream tasks using a 3-layer MLP header.
-
2)
RoBERTa (roberta): a popular pre-trained language model that has also been used for programming languages (codebert). The model is constructed with 12 transformer layers and pre-trained on a large English corpus with the MLM objective. We fine-tune it with a 3-layer MLP header over the “[CLS]” position.
-
3)
RoBERTa-large555https://huggingface.co/roberta-large: a large version of RoBERTa (H=1024, A=16, L=24) with around 300 million parameters. We compare with this model to verify the advantages of Zecoler over large-scale PLMs.
-
4)
CodeBERTa666https://huggingface.co/huggingface/CodeBERTa-small-v1: a version of RoBERTa pre-trained with CodeSearchNet, which was proposed by Huggingface. We use its default setting in our experiments.
-
5)
CodeBERT (codebert): one of the state-of-the-art models for learning code representations. A more detailed description of CodeBERT can be found in Section 2.1. We follow the same experimental setup in its original paper.
We implement these baseline models by referring to the work of CodeXGlue (codexglue). In classification tasks, we construct 3-layer fully connected neural networks as the fine-tuning header which maps the hidden vector of the “[CLS]” token to the class labels of downstream tasks. In generative tasks, we use a 6-layer transformer decoder as the fine-tuning header for generating the target sequence of downstream tasks.
5 Results
5.1 RQ1: Effectiveness of Zero-shot Learning
In this experiment, we evaluate the effectiveness of Zecoler in zero-shot code representations learning. We initially train a representation model for each task using data samples of Java. Then, we adapt the trained model to the target languages (i.e., Solidity and Go) directly without extra training. We train the model with both 5,000 and 500 data samples of Java to assess the effects under different data sizes.
Model | Number of Layers | CD | CS | MNP | |||
Solidity | Go | Solidity | Go | Solidity | Go | ||
AVG | - | 57.5 | 49.2 | 49.0 | 50.3 | 50.8 | 50.0 |
RoBERTa | 12 | 60.5 | 49.4 | 49.6 | 49.4 | 50.0 | 50.3 |
RoBERTa-L | 24 | 47.3 | 51.0 | 48.7 | 48.8 | 51.7 | 48.5 |
CodeBERTa | 6 | 57.9 | 67.3 | 53.2 | 53.1 | 49.7 | 49.0 |
CodeBERT | 12 | 65.4 | 91.7 | 48.9 | 46.2 | 52.1 | 65.2 |
Zecoler 5000 | 12 | 79.8 | 96.4 | 67.1 | 80.3 | 59.2 | 98.8 |
Zecoler 500 | 12 | 74.9 | 82.4 | 53.3 | 56.9 | 68.1 | 90.4 |
-
*
The target languages (i.e., Solidity and Go) are not provided with training data. All source languages are trained with 5000 samples except the last one which is trained with only 500 samples.
Table 2 shows the accuracy of different models in three classification tasks. We can observe that Zecoler significantly outperforms baseline models in all three tasks and all target languages. In the code clone detection task, the accuracy of Zecoler is 5%-14% greater than that of CodeBERT, the strongest baseline. The improvement is much more significant in the code search (30% in average) and method name prediction (24% in average) tasks. By contrast, AVG and RoBERTa-large obtain results that are close to random, indicating that they can hardly learn useful knowledge from few data samples.
The same trend can be observed when only 500 (1/10) samples of the source language are provided for training. As the data size decreases from 5000 to 500, the accuracy of Zecoler drops in all tasks. Nevertheless, it still significantly outperforms the baseline models. This means that Zecoler can learn representations much more efficiently while requiring smaller data compared with baselines.
Another interesting observation is that Zecoler trained with 500 data samples outperforms that with 5,000 data samples in the method name prediction task for Solidity. One potential reason is that the model overfits the big Java data in the training process, which results in inferior performance in the target low-resource languages such as Solidity.
It is notable that CodeBERT outperforms RoBERTa and CodeBERTa in both code clone detection and method name prediction, except for the code search task. We conjecture that CodeBERT is pre-trained on programming languages whereas the RoBERTa is only pre-trained on natural languages. Hence, CodeBERT can be better adapted to PL related tasks.
Answer to RQ1: Our approach shows greater performance than baseline models in no-resource code classification tasks, affirming the strong ability of Zecoler in zero-shot code representation learning.
5.2 RQ2: Effectiveness of Few-Shot Learning
In this experiment, we evaluate the effectiveness of Zecoler in few-shot learning of code representations. This simulates the scenario where there are a few training data samples in the target language. We continue training the model in RQ1 using a few data samples of the target languages. We vary the data sizes from 32 to 700 in Solidity and Go and evaluate the performance in three classification tasks.






Figure 5 shows the results in classification tasks. Compared to CodeBERT, the strongest baseline in RQ1, Zecoler demonstrates greater strengths in all tasks when provided with a few data samples of the target language. When the data size is 700, the accuracy of Zecoler is about 30% greater than that of CodeBERT. This means that Zecoler outperforms existing approaches in learn code representations even a small number of samples are given. In RQ1, the model has already learned knowledge of downstream tasks in the training phase. It is easier to achieve good performance by continually training on the target domain instead of training from scratch.
We notice that when the data size is extremely small (e.g., 32), the model tends to overfit data. In this situation, zero-shot learning is more preferred.
Answer to RQ2: Zecoler demonstrates effective performance when a few data samples of the target domain are given. As in zero-shot setting, Zecoler still keeps the best accuracy or BLEU among all compared models in few-shot setting.
5.3 RQ3: Effectiveness of Monolingual Few-Shot Learning
Different from RQ2 in a cross-language few-shot setting, in this experiment we evaluate the effectiveness of our approach in a monolingual few-shot setting. We train the models with a few samples of Java, Solidity and Go, and evaluate the performance of the tasks in the same language.
Model | CD | CS | MNP | Average | ||||||
Java | Solidity | Go | Java | Solidity | Go | Java | Solidity | Go | ||
AVG 300 | 51.6 | 64.1 | 50.1 | 49.1 | 50.2 | 50.1 | 47.9 | 48.1 | 50.1 | 51.3 |
RoBERTa 300 | 46.6 | 73.5 | 50.1 | 52.5 | 55.2 | 50.1 | 50.1 | 53.7 | 50.1 | 53.5 |
RoBERTa-large 300 | 53.0 | 75.9 | 53.9 | 50.4 | 57.4 | 51.4 | 47.8 | 55.6 | 50.1 | 55.1 |
CodeBERTa 300 | 50.7 | 68.3 | 65.3 | 52.0 | 58.8 | 45.0 | 50.8 | 61.8 | 47.4 | 55.6 |
CodeBERT 300 | 51.3 | 69.4 | 49.5 | 50.8 | 56.5 | 49.5 | 49.3 | 53.9 | 49.5 | 53.3 |
Zecoler 300 | 85.8 | 94.3 | 99.3 | 51.7 | 90.1 | 99.5 | 98.7 | 88.8 | 99.2 | 89.7 |
Zecoler 100 | 63.6 | 93.9 | 99.5 | 52.6 | 63.0 | 95.7 | 72.8 | 62.8 | 77.7 | 75.7 |
-
* In this experiment, we train and test the model in the same programming language respectively. The training data size of source languages is 300 except the last one with only 100 data samples.
Table 3 shows the accuracy of different approaches in three classification tasks. We can observe that Zecoler consistently outperforms baselines in the monolingual few-shot setting. Most of the baseline models simply predict random answers, with an accuracy of around 50%. This indicates that the baseline models cannot learn meaningful code representations with scarce data. Comparatively, Zecoler achieves 75.7% accuracy in average with only 100 data samples. The results suggest that Zecoler learns code representations efficiently in the monolingual few-shot setting.






Figure 6 shows the performance of Zecoler, CodeBERT, and CodeBERTa with different data sizes in the classification task. We can see that Zecoler outperforms the other two baselines under almost all data sizes. Furthermore, as the data size increases, the accuracy of Zecoler grows faster than that of baseline models. This indicates that Zecoler is effective in learning code representations given only a few data samples.
We have also observed that monolingual learning outperforms cross-language learning on small data sizes (e.g., 32 and 100), but achieves similar performance when the data size becomes larger. This is because continuously training on scarce data of a different language can lead to overfitting.
Answer to RQ3: Zecoler is effective in monolingual few-shot learning, and demonstrates much stronger performance than that in the cross-language setting.
5.4 RQ4: Effectiveness of Zero-shot Learning in Generative Tasks
In this experiment, we evaluate the effectiveness of Zecoler in zero-shot generative tasks. Similar to the experiment in RQ1, we initially train a representation model for each task using 5,000 data samples of Java, and then apply the trained model to the target languages (i.e., Solidity and Go) without extra training.
Model | Solidity | Go | JavaScript | Ruby | ||||
BLEU | ROUGE-L | BLEU | ROUGE-L | BLEU | ROUGE-L | BLEU | ROUGE-L | |
RoBERTa | 11.35 | 19.12 | 7.96 | 15.55 | 6.78 | 11.79 | 7.71 | 14.24 |
RoBERTa-L | 12.52 | 20.30 | 8.43 | 16.61 | 8.01 | 13.84 | 9.07 | 16.10 |
CodeBERTa | 12.01 | 17.69 | 8.74 | 16.77 | 7.75 | 12.05 | 8.90 | 15.84 |
CodeBERT | 12.68 | 19.39 | 8.21 | 17.64 | 8.60 | 15.40 | 9.45 | 17.05 |
Zecoler | 13.37 | 20.67 | 8.67 | 17.73 | 9.54 | 16.88 | 9.97 | 17.59 |
Model | Solidity | Go | JavaScript | Ruby | ||||
BLEU | ROUGE-L | CodeBLEU | ROUGE-L | CodeBLEU | ROUGE-L | CodeBLEU | ROUGE-L | |
RoBERTa | 2.69 | 19.09 | 7.88 | 17.68 | 7.55 | 13.71 | 9.05 | 19.43 |
RoBERTa-L | 4.01 | 20.29 | 9.55 | 18.09 | 8.51 | 14.87 | 10.02 | 20.69 |
CodeBERTa | 3.29 | 20.12 | 7.70 | 18.42 | 7.11 | 13.61 | 9.14 | 20.47 |
CodeBERT | 2.74 | 19.29 | 8.68 | 18.13 | 8.36 | 15.01 | 9.67 | 20.51 |
Zecoler | 3.20 | 20.78 | 9.18 | 18.78 | 8.52 | 15.26 | 10.05 | 21.26 |
Table 4 and 5 show the accuracy of various models in code summarization and code generation tasks, respectively. We measure the accuracy of code generation using CodeBLEU (codebleu) instead of BLEU except in Solidity which is not supported by CodeBLEU. We can observe that Zecoler consistently outperforms baseline models across most target languages in two tasks. Compared with CodeBERT, Zecoler gains 6.68% and 4.84% improvement in code summarization and 4.35% and 4.33% improvement in code generation in terms of BLEU and Rouge-L.
We also notice that RoBERTa-large and CodeBERTa surpass our approach in the code summarization task of Go. The main reason can be that these models have a bigger model size, which will be further discussed in RQ5.
Answer to RQ4: Our approach consistently outperforms existing approaches in zero-shot generative tasks, indicating good generalizability of Zecoler.
5.5 RQ5: Ablation Study
In this experiment, we inspect the performance of Zecoler under different hyperparameters. We vary the prompt templates and numbers of prompt tokens to search for the optimal prompt template. We also explore the impact to the performance by different source languages and different scales of backbone PLMs.
Prompt Templates: We first explore the effect of prompt templates on the performance. We vary the position of the prompt tokens in the prompt template, namely, head: [, , , MASK], middle: [, , , MASK], uniformly: [, ,, ,, MASK] and tail: [, , MASK, ]. The number of prompt tokens () is fixed to 10. We train the model with 700 Java code snippets and evaluate the model on the code clone detection task of Solidity.
As shown in Figure 7(a), placing prompt tokens uniformly achieves the best performance compared to other templates. The reason could be that prompts have more influence to nearby tokens. By placing prompts uniformly, every input token can be influenced by sufficient prompts.
Number of Prompt Tokens: We further assess the impact of prompt numbers. We insert prompt tokens uniformly into the PLM input and vary the number of prompt tokens from 1 to 20. We train the model with 700 Java code snippets and evaluate it on the code clone detection task of Solidity.


As Figure 7(b) shows, the number of prompt tokens is strongly correlated to the performance of representation learning. Fewer prompt tokens are insufficient to steer the PLM to yield meaningful prediction, while large numbers of prompts restrict the input size. The optimal number of prompt tokens is 10 in our experiments. A similar test has been made in generative tasks and the optimal number of prompt tokens is also 10.
Source Languages: To study the impact of different source languages, we train the model using 5,000 data samples of Java, Python, and C++, respectively. We evaluate the performance of zero-shot code representation on the code clone detection task of nine target languages in the CodeNet. Figure 8 shows the results. We can observe that using Java as the source language achieves the best performance. This can be attributed to two reasons. First, CodeBERT is pre-trained using Java. Second, as a common language, Java contains more general features of programming languages compared to other languages. This facilitates the transfer of model to other languages with zero- or few-shot samples.
Pre-trained model size: Lastly, we study the effect of model size. In classification tasks (Table 2 and 3), larger PLMs have a negative effect on the performance. For example, RoBERTa-large is almost twice the size of RoBERTa, but the former performs even worse than the latter. On the contrary, in generative tasks (Table 4 and 5), larger PLMs perform better than those in classification tasks.
This discrepancy could be caused by different model structures between classification and generative tasks. In classification tasks, the models need to encode two separate text inputs and combine them to the fine-tuning header, which makes it difficult for large PLMs to converge in the transfer learning process. However, in generative tasks, PLMs in different sizes take a single text snippet as input and decode the final answer. Hence, larger PLMs, with more common knowledge learned in pre-training phase, have more potential to enhance the downstream tasks.
Answer to RQ5: The effectiveness of our approach is affected by prompt templates, source languages, and PLM scales. Inserting ten prompt tokens uniformly to the original PLM input can better steer the PLM to learn code representations. Java as the source language can be better generalized to other languages. In the zero- or few-shot setting, PLM scale does not always make a positive impact, which is closely related to the type of task.
5.6 Qualitative Analysis
To demonstrate the effectiveness of our approach, we qualitatively analyze concrete examples of five code intelligence tasks by Zecoler and CodeBERT.
Figure 9(a) shows the results of clone detection for two code snippets in Solidity. They are similar in functionality while being different in key words and structures. For example, “paymentAddress” and “kashout” (highlighted in red) are two equivalent keywords in the two snippets. Because the two words are both domain specific, baseline models such as CodeBERT can hardly detect the clone without prior knowledge. Comparatively, Zecoler successfully detects the clone by reusing prior knowledge from PLMs using prompt learning.
Figure 9(b) shows an example of code search. The query searches for programs that dial with a time-out limit. The target code involves a Go specific API “context.WithTimeout” which can hardly be recognized by CodeBERT. Thanks to the zero-shot learning of Zecoler, our approach successfully recognizes both the domain-specific API and the query, hence gives a correct answer.
Figure 9(c) compares the results by Zecoler and CodeBERT in the method name prediction task of Go. The “panic” in the code is a language-specific API which interrupts the program when an error occurs. As can be seen, CodeBERT can hardly comprehend the function “panic” in zero-shot scenario, while Zecoler can fully understand it and predict the correct method name.
Figure 9(d) presents an example of code summarization. Compared with the summary generated by CodeBERT, the one generated by Zecoler contains more precise keywords such as “Special”. Also, the meaning of “new” in its output is similar to that of ”initializes” in the expected one. In contrast, CodeBERT is not capable of grasping the meaning of the code snippet, and thus the generated summary is far from the expected output.
Figure 9(e) compares the generated code in Go for the query “CompareBS is a barrier session comparator based on seqno”. The Zecoler generates code with correct syntax especially the return and if statements, while many repeated error tokens such as “Sql.” can be found in the code generated by CodeBERT.
These examples demonstrate the superiority of Zecoler in zero-shot code representation learning. The prompts in Zecoler cast the underlying meaning in the PLM to downstream tasks, which help large PLMs capture text and code semantics even without training data.
6 Threats to Validity
Internal Validity. Our approach is built upon CodeBERT. Although CodeBERT is the most popular PLM for learning code representations, other PLMs for code such as CodeT5 (with an encoder-decoder architecture) and Codex (unidirectional Transformers) may have different results. However, we argue that our approach is independent on the PLM architecture itself since we merely modify the format of input and output of the PLM.
Prompts have a small amount of trainable parameters and may lead to unstable results. Although we alleviate this problem by employing an LSTM, we still observe that the prompt representations fall into local minima in gradient descend. We have to train the model repeatedly with different random seeds for the best automatically-generated prompts. We leave the dynamic optimization of prompts for our future work.
External Validity. In our work, the classification downstream tasks are assumed to be binary classifications. Hence, we represent the binary answers using two candidate words. Hence, more candidate words for multi-class classification tasks remain to be investigated. The candidate words are manually selected and can be searched to find the most suitable ones. We can further represent the candidate words as trainable vectors just like the prompts in our approach.
7 Related Work
7.1 Learning Code Representations
As the core prerequisite for many code intelligence tasks, learning code representations has been extensively explored in software engineering (codexglue). Broadly, typical approaches in learning code representations can be classified into three categories, including unsupervised for general languages, supervised for specific tasks, and few-shot learning.
The most typical category of work lie in the unsupervised approaches such as code2vec (code2vec), code2seq (code2seq), and InferCode (infercode). Code2vec and code2seq aggregate representations of each path in AST (abstract syntax tree) based on attention. InferCode predicts subtrees automatically identified from the contexts of an AST in a self-supervised manner. These methods directly learn code representation from AST paths. They utilize the word embedding techniques in natural language processing and incorporate them with semantic and syntax information in program source code. The limitation of these methods is the lack of adaptions to downstream tasks. The learned code vectors are fixed and cannot be fine-tuned on downstream tasks. Furthermore, these methods are purely trained on code, thus are unsuitable for NL-PL tasks such as code search.
To improve the performance of downstream tasks, researchers have also resorted to task-oriented supervised learning methods (codenet). For example, for code clone detection task, FangLS0S20 caught the similarity of semantics between two code snippets using a supervised deep learning model, which pays attention to caller-callee relationships and learns the hidden syntactic and semantic features of source codes. ZhangHZWLS21 disentangled the representation of semantic and syntax with AST and GAN (generative adversarial network), then used only semantic representation to detect code clone. For code search task, gu2018deepcs proposed a code representation model named CODEnn to learn semantic representations of code snippets through jointly embedding with comments, HaldarWXH20 designed a multi-perspective cross-lingual neural framework, and Liw20 learned code-query interactions. ZhangCLP21 proposed a hybrid code representation learning approach to resolve program dependence and semantics for predicting method name. YangCZSZ21 learned a unified vector representation of both methods and bug reports for method-level fault localization. ZhouLSD019 constructed a graph neural network to learn semantic representation of code to identify the vulnerable functions. WangL21a proposed AST Graph Attention Block to capture different dependencies in the AST graph for representation learning in code completion. These models are trained for the specific downstream tasks, which achieve good performance but lack generality to support multiple tasks with one single model.
The aforementioned methods require a large scale corpus to train the code representation model. To alleviate this problem, pre-trained programming language models are proposed such as CodeBERT (codebert) and CodeT5 (codet5). It is a fine-tuning based few-shot program learning paradigm: PLMs learn a vast amount of knowledge from large scale unlabelled corpora in the pre-training phase, and achieve state-of-the-art accuracy in the fine-tuning phase with a small amount of labelled task-specific data. This gives PLMs the basic generalization ability to handle a wide range of downstream tasks well. Task adaption through fine-tuning adds extra knowledge of specific tasks to PLMs and improves the performance. However, in this paradigm, the gap between the pre-training phase and the downstream task can be significant: the objectives are different, and for the downstream tasks, we usually need to introduce new parameters.
To the best of our knowledge, our Zecoler is the first zero-shot learning method for code representation. Zecoler follows a prompt-based learning paradigm for task adaption. Prompt learning makes it possible for downstream tasks to take the same format as the pre-training objectives and require no new parameters. By narrowing the gap between the two phases, deploying the PLMs on specific tasks becomes much easier with little training data.
7.2 Prompt-based Learning
As a promising method for zero-shot learning, a growing number of prompt-based learning approaches (abs210713586) have been proposed in recent years. For example, pet proposed PET which transforms the classification task into an MLM task and uses prompt to elicit knowledge from PLM. But the prompt is manually crafted and hard to select the most suitable words for it. autoprompt proposed AutoPrompt which automatically searches prompt words discretely using gradient signals in the target task. Although discrete searching retains the semantic of prompt, it also cannot find out the most precise prompts for machine models. For solving this problem, instead of using prompt for solving downstream tasks directly, abs-2210-14803 proposed a method that uses prompt for filtering training dataset and trains the model more efficiently. But the quality of dataset is hard to control, which makes the accuracy of downstream tasks low. ArcoVK22 tackled this challenge with combinations of multiple prompts for more robust prompt template. But this way relies on human selection and still can not make sure to find the best prompt. prefixtuning proposed Prefix-Tuning which optimizes a continuous task-specific vector prepended to every layer of the Transformer in PLM and freezes the PLM for saving computation cost. Prefix-Tuning demonstrates superb performance but it only focuses on natural language generative tasks. nomorefinetune empirically evaluated the usage and effect of prompt tuning in code intelligence tasks including defect prediction, code summarization, and code translation. The results show that prompt tuning outperforms fine-tuning in full data, and also shows great potential in low-resource scenarios.
Comparatively, Zecoler optimizes the prompt vectors in continuous space instead of discrete words or human-writing, making the prompt more suitable for PLMs to understand and more efficient for extracting knowledge. Moreover, Zecoler is the first prompt method for zero-shot code representation that can generalize to various code understanding and generative tasks.
8 Conclusion
In this paper, we propose Zecoler, a novel approach for zero-shot code representation learning via prompt tuning. Zecoler improves traditional pre-trained programming language models by introducing prompt into code representation learning. Experiments show that Zecoler outperforms baseline models in both code understanding and generative tasks under zero-shot settings. Code representations learned by Zecoler also demonstrate good generalizability for low-resource programming languages.
In the future, we will investigate our approach in more software engineering tasks, with other pretrained models such as CodeT5 and Codex. We will also consider more characteristics of source code such as syntactic structures in the design of prompt and verbalizer.
Data Availability
Our source code and experimental data are publicly available at: https://github. com/ChrisCN97/zecoler/tree/emse.
Acknowledgments
This research is supported by National Natural Science Foundation of China (Grant No. 62232003, 62102244), CCF-Tencent Open Research Fund (RAGR20220129).
Conflict of Interest
The authors declared that they have no conflict of interest.