On The Cross-Modal Transfer from Natural Language to Code through Adapter Modules

Divyam Goel Indian Institute of TechnologyRoorkeeIndia [email protected] , Ramansh Grover Delhi Technological UniversityDelhiIndia ramanshgrover˙[email protected] and Fatemeh H. Fard University of British ColumbiaCanada [email protected]

(2022)

Abstract.

Pre-trained neural Language Models (PTLM), such as CodeBERT, are recently used in software engineering as models pre-trained on large source code corpora. Their knowledge is transferred to downstream tasks (e.g. code clone detection) via fine-tuning. In natural language processing (NLP), other alternatives for transferring the knowledge of PTLMs are explored through using adapters, compact, parameter efficient modules inserted in the layers of the PTLM. Although adapters are known to facilitate adapting to many downstream tasks compared to fine-tuning the model that require retraining all of the models’ parameters– which owes to the adapters’ plug and play nature and being parameter efficient– their usage in software engineering is not explored.

Here, we explore the knowledge transfer using adapters and based on the Naturalness Hypothesis proposed by Hindle et. al (Hindle et al., 2016). Thus, studying the bimodality of adapters for two tasks of cloze test and code clone detection, compared to their benchmarks from the CodeXGLUE platform. These adapters are trained using programming languages and are inserted in a PTLM that is pre-trained on English corpora (N-PTLM). Three programming languages, C/C++, Python, and Java, are studied along with extensive experiments on the best setup used for adapters. Improving the results of the N-PTLM confirms the success of the adapters in knowledge transfer to software engineering, which sometimes are in par with or exceed the results of a PTLM trained on source code; while being more efficient in terms of the number of parameters, memory usage, and inference time. Our results can open new directions to build smaller models for more software engineering tasks. We open source all the scripts and the trained adapters.

Pre-trained Language Models, Transfer learning, Adapters, Parameter Efficient Models

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: 30th International Conference on Program Comprehension; May 16–17, 2022; Virtual Event, USA^†^†booktitle: 30th International Conference on Program Comprehension (ICPC ’22), May 16–17, 2022, Virtual Event, USA^†^†price: 15.00^†^†ccs: Software and its engineering Software maintenance tools^†^†ccs: Computing methodologies Neural networks

1. Introduction

Refer to caption — Figure 1. The parameter budget of adapters and C-PTLM for code clone detection.

Deep Pre-Trained Language Models (PTLM) such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) provide powerful, general-purpose linguistic representations that have empowered significant advances in various Natural Language Processing (NLP) tasks such as text classification and language understanding (Liu et al., 2019). The PTLMs employ large unlabeled Natural Language (NL) datasets with self-supervised learning objectives such as Masked Language Modeling (MLM) and Next Sentence Prediction (Devlin et al., 2018), and are then fine-tuned on downstream tasks. In software engineering, recent efforts apply such approaches, pre-training the models on source code that we refer to as C-PTLMs. CodeBERT (Feng et al., 2020), and CuBERT (Kanade et al., 2020) are two of such C-PTLMs that are developed to obtain linguistic representations for source code. CodeBERT is a multilingual pre-trained model which is bimodal, i.e., trained on NL and programming language (PL); CuBERT uses a dataset of PL to train a BERT model. Note that NL and PL are considered as different modalities (Feng et al., 2020). These models are fine-tuned on several software engineering downstream tasks such as code clone detection and code search (Feng et al., 2020; Kanade et al., 2020).

Fine-tuning large PTLMs is the most common approach to knowledge transfer from existing models to downstream tasks. When the model is fine-tuned, all of the learned weights of the model is trained again on labelled data. Although this approach achieves state-of-the-art performances on many NLP (Liu et al., 2019; Lan et al., 2020) and software engineering tasks (Feng et al., 2020; Kanade et al., 2020; Wang et al., 2021), it is computationally expensive as for each task of interest, the entire parameters of the model should be fine-tuned, leading to several large models for the desired tasks. Additionally, for each task, the users should save the entire model (Pfeiffer et al., 2021a), leading to inefficient memory usage. Consequently, it is imperative to explore compacter alternatives to knowledge transfer to overcome these caveats.

In NLP, adapter modules for Transformers (Houlsby et al., 2019) provide parameter efficient, compact, and extensible approaches to knowledge transfer among tasks or languages. Adapters have been proposed recently as an alternative approach to fine-tuning for domain and task transfer, transfer learning, cross-lingual transfer, and transferring to unseen languages. In this sense, adapters share the parameters of the PTLM for all tasks/languages while introducing a small set of task/language-specific parameters in the intermediate layers of the PTLM. In this way, adapters encapsulate domain knowledge in a parameter and memory-efficient manner. By using the adapter modules, only a tiny set of weights are trained instead of fine-tuning the entirety of the model. A number of adapter architectures like serial to parallel (Zhu et al., 2021), language-specific transformations (Bapna and Firat, 2019; Artetxe et al., 2020; Philip et al., 2020; Zhu et al., 2021), and task-specific transformations (Pfeiffer et al., 2021a, 2020b) have been proposed. Other studies focus on using multiple adapter modules to disentangle different elements of knowledge relevant to the target domain of the downstream task (Pfeiffer et al., 2021a) and invertible adapter architectures for effectively adapting a multilingual model to a new language (Pfeiffer et al., 2020b). Even though the results of models with adapters are promising in NLP, the capability of adapters are not explored for software engineering, nor they are extended to other language modalities, particularly programming languages for software engineering tasks.

In addition, despite the known similarity of the programming languages to natural languages (Allamanis et al., 2018; Hindle et al., 2016), the recent effort is on introducing new pre-trained models on source code with various objectives; but the studies on transferring the knowledge from natural language to programming languages are limited. In this paper, we explore adapters for programming languages. The main objective of this research is to study to what extent adapter modules can be used to transfer the representations of natural language (English) to programming languages. This is done by a Cross Modal model, that we refer to as MODE-X, that utilizes adapters as the main modules by training and inserting the programming language-specific adapters inside the layers of RoBERTa (Liu et al., 2019), which is pre-trained on a large English corpus. We evaluate the models on two tasks of cloze test and code clone detection. These tasks exist on the CodeXGLUE (Lu et al., 2021), General Language Understanding Evaluation benchmark for CODE¹¹1https://github.com/microsoft/CodeXGLUE, which evaluates neural models that are trained for source code. We compare the results of our models with results obtained by fine-tuning RoBERTa and CodeBERT, including the parameter and memory usage comparisons. We run several experiments to study the impacting layers of adapters and how adapters perform when tested on unseen programming languages. Figure 1 shows the parameter efficiency of MODE-X when tested on three datasets for code clone detection. This plot shows that MODE-X is 60-140 times more parameter efficient while achieving comparable performance for C/C++ and Java code clone detection datasets.

Note that the main objective here is not to present a new model, but to explore adapters in software engineering and for source code.

Significance: The results of our study can impact software engineering practitioners and researchers from different perspectives: i) fine-tuning the PTLMs for different tasks and using deep neural networks are computationally expensive, and not everyone has access to such powerful GPU processing units. The adapters on the other hand are plug and play modules that can be inserted in any PTLM and they can be trained on free cloud services such as Google Colab. ii) Although we only study them for one N-PTLM, they can be used in other PTLMs and C-PTLMs and are not bound to a specific PTLM. iii) Adapters are small parameter and memory efficient modules that enable scaling up the large PTLMs to many tasks and languages, without noting a significant drop in in-domain performance associated with the “curse of multilinguality” of the model (Conneau et al., 2020). The curse of multilinguality is related to PTLMs that are trained on multiple languages and is the trade-off between language coverage and model capacity. The limited capacity of the PTLMs leads to drop in the performance of a multilingual model when more languages are added, compared to its monolingual variants. iv) Parameter efficient model results in faster inference time. Also, due to the lower memory overhead, we can use the same device for a higher number of tasks. This also enables us to integrate the models in Integrated Development Environment (IDE), which require the model to be small. v) We open source our trained adapters, which can be used in different studies and for various software engineering tasks. The results of our work can open new avenues of research for transferring the learned knowledge from natural language to programming languages and in software engineering, in addition to developing models that are more computationally efficient.

Contribution: This is the first work that applies adapter modules for software engineering. We also study the bimodality of adapters, adapting the natural languages to programming languages for the first time. Thus, all experiments and obtained results are among the novelties of our work. We open source our scripts, a document including all the detailed results and dataset SCD-88. We also open source the trained adapters (See Replication Package).

The rest of this paper is organized as follows. In Section 2 we provide details about adapters, which is followed by design of our study, experimental setup, results, and discussions in Sections 3 – 6. Threats to validity are discussed in Section 7. We overview the related works in Section 8 and conclude the paper in Section 9.

2. Background

2.1. Transformers and PTLMs

Transformers are state of the art neural network architecture that achieved the best results in many of the NL tasks (Vaswani et al., 2017). Transformer is stacks of encoders and decoders, each considered as a layer. It uses attention mechanism through the multi-head self attention sub-layer which is followed by a feed forward sub-layer. The multi-head self attention helps the model encode each word by attending to other words in the input sequence. Each of these sub-layers in each encoder has a residual connection, and a layer normalization is applied after each one (i.e., multi-head self attention and feed forward network). Bidirectional Encoder Representations From Transformers, BERT, is the predecessor of the N-PTLM used in our study (Devlin et al., 2018). BERT enables fine-tuning the model for downstream tasks with one additional output layer. After BERT, many PTLMs were introduced that are based on Transformer, e.g., RoBERTa (Liu et al., 2019), which is the main architecture for many C-PTLMs.

2.2. Adapters

Adapters are small bottleneck layers that are inserted to a PTLM (mainly to a multilingual PTLM) and enable adapting a PTLM to a new language (Pfeiffer et al., 2020b). The adapters leverage a small number of parameters to adapt the PTLM. They are trained as language specific adapter modules (L-adapter) or task specific adapter modules (T-adapter). The former is trained via masked language modeling on unlabelled data of a target language of interest, and the latter optimizes a target task on labelled data. This training allows the PTLM to be adapted to unseen languages that are not covered in the PTLM. The framework for adapters that we use in our study is based on Multiple Adapters for Cross-lingual transfer (MAD-X) (Pfeiffer et al., 2020b), which uses an architecture as its basis that allow sharing of information between multiple tasks (Pfeiffer et al., 2021a). MAD-X enables adaptation to unseen languages in the PTLM “without learning expensive language-specific token-level embeddings”, due to being trained while keeping the parameters of the PTLM fixed (i.e., frozen). The overall architecture of the adapters is shown in Figure 2. The language and task adapters modules are inserted after the feed forward network, in each layer of the Transformer-based PTLM. The T-adapters are stacked on the L-adapters when they are needed/used for the downstream task. The language adapter $LA_{l}$ at layer $l$ of the Transformer is defined as

Figure 2. Language, task, and invertible adapters in the MAD-X framework.

(1)

LA_{l}(h_{l},r_{l})=U_{l}(ReLU(D_{l}(h_{l})))+r_{l}

where $D\in{\mathbb{R}}^{h\times d}$ , $h$ is the hidden size of the Transformer model, $d$ is the dimension of the adapter, and $D$ is the down-projection. $ReLU$ is the activation function and $U\in{\mathbb{R}}^{d\times h}$ is up-projection at every layer $l$ . $h_{l}$ (output of the subsequent layer normalization) and $r_{l}$ (output of the feed forward layer) are the hidden state and residual at layer $l$ of the Transformer, respectively. During training of T-adapters, which are trained using labelled data, the parameters of the L-adapter of the corresponding language and the Transformer are frozen. The task adapters, $TA_{l}$ at layer $l$ of the Transformer model is similar to $LA_{l}$ and is calculated as below:

(2)

TA_{l}(h_{l},r_{l})=U_{l}(ReLU(D_{l}(LA_{l})))+r_{l}

Invertible Adapters The invertible adapters are proposed in (Pfeiffer et al., 2020b) to deal with the mismatch between the vocabularies of the multilingual PTLM and the new unseen or low resource language. These are inserted on top of the input embedding layer and their inverses before the output embedding layer, as shown in the left part of Figure 2. Note that each language should have an invertible adapter in this framework. The function of invertible adapters are similar to language adapters, with the aim of capturing language specific transformations at token level. They are trained with language adapters using MLM on unlabelled data. This inversion enables efficient utilization of the “parameter budget”. This allows us to leverage the same set of parameters to adapt both input and output representations. Invertibility becomes crucial to ensure that the model does not overfit on the pre-training objective, i.e., MLM, when the model is fine-tuned on a task. For fine-tuning the model for a specific task, we remove the output embedding layer and its corresponding inversion layer following which we freeze the parameters of the L-adapters along with the PTLM’s parameters.

3. Study Design

Figure 3. Steps of our experiments.

In this section, we explain the design of our study to answer the following research questions. We design our study to use RoBERTa, which is the base model of CodeBERT. However, as the source code for CodeBERT is not available, we are unable to experiment models that insert adapters in CodeBERT.

•

RQ1: How do adapters perform in representing the code when adapting N-PTLM to a target programming language?
•

RQ2: How well do MODE-X facilitate cross-modal transfer to downstream tasks compared to full fine-tuning the PTLMs?
•

RQ3: How computationally efficient are the adapters compared to full fine-tuning PTLMs?
•

RQ4: Which layers have more impact on N-PTLM for adapting from natural language to programming language?

Figure 3 represents a diagram of the steps in our study. We first choose an N-PTLM as the base PTLM. We train three PL-specific adapters on unlabelled data, one for C/C++, one for Java, and one for Python. These three programming languages are chosen based on the availability of the PTLMs using them and availability of data for training and testing in the downstream task. The L-adapter is inserted in each layer of N-PTLM. We use two different datasets, CodeNet (CN) (Puri et al., 2021) and CodeSearchNet (CSN) (Husain et al., 2020) for training L-adapters separately, therefore, evaluating two models for each PL. These are shown with L-adapters_CN and L-adapters_CSN.

We choose two of the tasks from code-code category of the CodeXGLUE benchmark (Lu et al., 2021) that is published by Microsoft. This benchmark is chosen as it is a collection of tasks, datasets and a platform for evaluating and comparing machine learning models for code understanding and generation. The two chosen tasks are Cloze Test (CT) and code clone detection, which are used to answer RQ1 and RQ2, respectively. We only choose tasks from code-code category as this can show the ability of the adapters to adapt the learned knowledge of the N-PTLM to downstream tasks that are only based on code. To answer RQ1, CT is chosen as it can evaluate the ability of the trained model in predicting the missing tokens (Feng et al., 2020). This task can show how the contextual knowledge of the N-PTLM is transferred using the adapters to the new modality, i.e., code, as CT is considered as a task to test the code semantics (Lu et al., 2021). Our model that has L-adapters inserted in the N-PTLM is tested for CT. The results of this model will be compared to the results of N-PTLM (RoBERTa) and C-PTLM (CodeBERT). The evaluation metric for cloze test is accuracy, which is explained in section 4.4.

For RQ2, code clone detection is used as it can evaluate how well the model performs to find semantically similar fragments of code. For code clone detection, we use our model (L-adapters inserted in N-PTLM) and stack the T-adapters on top of L-adapters in all layers. Then, we add a custom head on top of the model complementary to the task at hand. As the parameters of the N-PTLM and L-adapters are frozen, it can be thought of as another pre-trained language model with the T-adapters being injected to adapt the model’s parameters away from the pre-training MLM objective to a new objective for the downstream task. This model that has L-adapters and T-adapters is referred as MODE-X in this work. The results will be evaluated against N-PTLM and C-PTLM that are fine-tuned for clone detection. Three datasets are used for code clone detection in C/C++, Java, and Python with evaluations metrics of F1 and MAP@R, which are explained in section 4.4.

The parameters of the models required for training and fine-tuning are recorded, as well as their memory usage, which will be reported in RQ3. As the L-adapters are inserted within each layer of the N-PTLM in our study, we are interested to understand how the performance of the model changes as we add L-adapters incrementally to each layer. We answer to this question in RQ4.

4. Experimental Setup

4.1. Baselines

RoBERTa (Robustly optimized BERT approach) (Liu et al., 2019) is based on BERT and modifies its pre-training steps, which yields substantially better performance on all the classification tasks. It uses an MLM objective and includes longer sequences. RoBERTa is used in previous software engineering studies (Zhang et al., 2020a) and is the base model for the current C-PTLM models, including CodeBERT (Feng et al., 2020). RoBERTa is released in different model sizes, for which we use the 12 layers architecture known as RoBERTa-base.

CodeBERT is a BERT-style model that is good in understanding problems related to source code (Feng et al., 2020). CodeBERT is one of the models that is used as baseline on CodeXGLUE platform. CodeBERT uses the same architecture as RoBERTa-base and is trained on two training tasks of MLM and Replaced Token Detection (RTD) (Clark et al., 2020). There are two versions of trained CodeBERT model that are publicly available, one that utilizes MLM as its training objective (CodeBERT_MLM) and is trained on code corpus of CodeSearchNet dataset. The other model uses MLM + RTD objectives (CodeBERT_MLM+RTD) and is trained on bimodal data (i.e., code and documents) of CodeSearchNet. CodeBERT is trained on a combination of 6 programming languages from CodeSearchNet. For the cloze test, we use CodeBERT_MLM, as cloze test is a task that requires the model to predict the masked token. The CodeBERT_MLM+RTD cannot perform cloze test as the final layers include a discriminator model. For the same reason, CodeBERT authors only published the results of the MLM variant for cloze test in their work (Feng et al., 2020). For code clone detection, we use both variants of the model, and report the best results here.

These two models are chosen for comparison as they show the transferability of the adapters from natural language to programming language, RoBERTa being at one extreme and CodeBERT being on the other extreme. In addition, they both use the same architecture and this can provide a fair comparison between the models, especially for parameter and memory efficiency. The other C-PTLMs are not chosen here, as they use a different architecture, or are trained on a different dataset, or are not available as benchmark.

4.2. Datasets and Tasks

Adapter Training Datasets: Two datasets are used to train L-adapters, to evaluate whether the differences in the datasets could make a difference in the ability of the adapters ²²2Interestingly, the size of the dataset used for training does not necessarily affect their capability. For example, although the CodeSearchNet (CSN) has much more Java and Python samples to train adapters (Table 1), adapters trained using CodeNet (CN) achieve higher scores for CT-Min/Max (Table 3) and very close scores to CSN adapters for code clone detection (Table 4). . The first one is CodeNet from IBM (Puri et al., 2021), a large scale, high quality data. that is collected for studies of artificial intelligence for code. To train the L-adapters, we randomly split the data of each PL into 90-10 split for train and validation sets, respectively. The second dataset is CodeSearchNet (Husain et al., 2020), a joint effort from GitHub and Microsoft Research and consists of code and comment pairs in 6 languages. CodeSearchNet has pre-determined splits for train, validation, and test sets that we use in our study. L-adapters are trained on the training set of each dataset separately, and evaluated on the validation set of each dataset separately. The CN and CSN statistics are shown in Table 1.

Table 1. Statistics of CodeNet and CodeSearchNet datasets

Language	Train #	Validation #	Total #
CodeNet (CN)
C/C++	559,497	62,167	621,664
Python	216,000	24,000	240,000
Java	67,500	7,500	75,000
CodeSearchNet (CSN)
Python	412,178	23,107	435,285
Java	454,451	26,909	481,360

Cloze Test (CT) is a probing experiment that is designed by authors of CodeBERT to evaluate their model’s capability in learning the linguistic information of code without modifying the model’s parameters (Feng et al., 2020). Here, given a code snippet, the task is posed as a multi-choice classification in which the model predicts the masked token of interest. CT task used here has two set ups on CodeXGLUE: CT-all and CT-Max/Min. In CT-all, the model should predict tokens in the source code, where the tokens come from the entire vocabulary. In CT-Max/Min, the tokens that should be predicted by the model come from {max,min} set. CT-Max/Min evaluates the model’s ability to understand code semantics (Lu et al., 2021). For testing a model on CT, no fine-tuning is required. Both CT-all and CT-Max/Min datasets are the combination of the validation and test sets of the CodeSearchNet data. We choose the portion of the dataset that is in Python and Java language as our study is on these languages. The number of instances for CT-Max/Min and CT-all in Python are 1,264 and 40,137, respectively. In Java, these numbers are 482 and 40,492 instances for CT-Max/Min and CT-all, respectively. CSN does not include C/C++. We tried building such dataset ourselves. But we could not find a dataset with similar vocabularies as CT in CodeXGLUE and their thresholds in C/C++.

Code Clone Detection (CCD) aims to identify similar fragments of code within a codebase, where identifying semantically similar code is the target task for evaluating some C-PTLMs (Wang et al., 2021). We utilize POJ-104 and Big Clone Bench (BCB) which are part of the Code-Code pipeline of CodeXGLUE (Lu et al., 2021). POJ-104 has C/C++ programs and aims to retrieve the top-k semantically similar codes and is evaluated using the MAP@R score (see Section 4.4). BCB intends to discern whether a given pair of fragments are semantically equivalent or not. It is a binary classification problem and is evaluated using the F1 score (see Section 4.4). As there is no Python dataset on CodeXGLUE for code clone detection, we consider the python-specific subset of the cross-language clone detection (XCD) dataset (Perez and Chiba, 2019). We refer to this as SCD-88 dataset, 88 pointing to the number of problems with several submitted solutions in Python. As the task of interest here is similar to the one used for POJ-104, we reformulate it as a retrieval task and evaluate using MAP@R score. Table 2 shows the respective splits for POJ-104, BCB, and SCD-88.

Reasons: CT evaluates the linguistic knowledge of the models which is important in tasks such as name prediction and code summaries. CCD is chosen as a practical Software Engineering problem. Other code-to-code tasks from CodeXGLUE require the same Language adapters but different Task adapters. As the analysis remains the same and our goal is to study the feasibility of using adapters in SE, we focus on these two tasks.

Table 2. Statistics of code clone detection datasets

Dataset	Train #	Validation #	Test #
BCB	901,028	415,416	415,416
POJ-104	32,000	8,000	12,000
SCD-88	7,800	1,040	2,600

4.3. Training Models

Training L-Adapters: We train the L- adapters using the invertible configuration (Pfeiffer et al., 2020b). L-adapters are trained on the code corpora of CN and CSN for each of the languages separately, leading to 5 L-adapers: Python-adapter_CN, Python-adapter_CSN, Java-adapter_CN, Java-adapter_CSN, and C/C++-adapter_CN. The L-adapters are trained on using the Adam optimizer, and a learning rate of 1E-4.

Training T-Adapters: The T-adapters are trained using the configuration as introduced in (Pfeiffer et al., 2020a). We use in-batch negative sampling to train these adapters keeping in line with the experimental setup described by the authors of CodeBERT (Feng et al., 2020). To prevent the adapters from overfitting, dropout and early stopping is used. The setup for T-adapters are the same as training baselines.

Training Baselines: To maintain consistency across our evaluations, we re-evaluated the existing benchmark performances of RoBERTa and CodeBERT for CT and clone detection in our study. We confirmed our obtained results with authors of CodeBERT, and found that they are acceptable, although our results fall within 2% error rate of what they reported. Keeping in line with the benchmark experiments of CodeXGLUE, we also utilize in-batch negative sampling. The choice of hyperparameters, learning rate schedules, and optimizers remain unchanged from CodeXGLUE’s benchmarking experiments, as although we chose different hyperparameters for fine-tuning baselines, the best results were obtained with their recommended ones and we use them in our evaluations.

All experiments are conducted on Nvidia Tesla V100 32GB GPU.

4.4. Evaluation Metrics

Accuracy is calculated as $\frac{TP+TN}{TP+TN+FP+FN}$ . Here, the TP shows the number of records that are correctly identified to belong to the positive class, FP are records that are incorrectly classified by the model to belong to the positive class, TN are records that are correctly predicted as negative examples, and FN are the records that are incorrectly predicted as negative class.

F1-Score (F1): F1 Score is the weighted average of Precision and Recall: $F1=\frac{2\cdot(P\cdot R)}{P+R}$ . Here, P stands for Precision computed as $P=\frac{TP}{TP+FP}$ , and R is Recall which is calculated as $R=\frac{TP}{TP+FN}$ .

Mean Average Precision at R (MAP@R) (Musgrave et al., 2020) is a metric used for informative accuracy which does not have the weakness of R-Precision metric and instead, accounts for the ranking of the correct retrievals. In R-Precision, a score of $r/R$ is assigned to each query, where for each query (e.g. a code that we want to find similar code samples), we find the r nearest samples to the query that are in the same class as query from a total number of references, R. Here, R denotes the total number of references in the searchable dataset. So, MAP@R calculates the Mean Average Precision with the number of nearest neighbors for each sample set to R. For a single query it is defined as follows where $P(i)$ is precision at $i$ if the $i$ th retrieval is correct and $0$ otherwise:

MAP@R=\frac{1}{R}\sum_{i=1}^{R}{P(i)}

5. Results

In the following, the results for our research questions are detailed. As CodeSearchNet does not include C/C++ language, there is no L-adapter for this language from CodeSearchNet dataset. It is worth mentioning that we do not aim to improve the results of the baselines, but to study to what extend adapters can perform.

5.1. RQ1: L-Adapters’ Representations

For this task, neither the L-adapters, nor RoBERTa and CodeBERT are fine-tuned. Just the trained models are evaluated for accuracy on CT, which evaluates the performance of L-adapters in capturing the representation of the programming languages. The results are presented in Table 3. The rows L-adapter_CN and L-adapter_CSN represent the Python-adapter or Java-adapter that are trained on CodeNet or CodeSearchNet and are inserted in the RoBERTa model. The L-adapters are tested on the programming language that they are trained on. The models are tested in two settings, once we test them on the datasets as we obtained them. These datasets include pairs of code and a short description of its functionality in natural language. This is what originally is used by the publishers on CodeXGLUE and is shown as ‘W/ NL’ in Table 3. In the second setting, we removed the natural language descriptions and then tested the L-adapters on cloze test dataset that contains code only. This is shown as ‘W/O NL’ in Table 3. In both settings, the models should predict the masked tokens, where in the ‘W/ NL’ setting the masked tokens are from NLP and PL and in the ‘W/o NL’ setting, the masked tokens are programming language only. The is not much difference between the results of the two settings.

Note that RoBERTa is pre-trained on natural language and L-adapter_CN and L-adapter_CSN models are used to adapt this N-PTLM to the programming languages. Adapters still are able to perform closer to CodeBERT, which is fully pre-trained on programming languages (improving CodeBERT results is not our goal here). Even on CT-all task, the Java-adapter that is trained on CodeSearchNet outperforms the results of CodeBERT.

The cloze test is composed of the validation and test sets of CodeSearchNet. So, the L-adapters that are trained on CSN have better results in CT-all compared to the L-adapters trained on CodeNet. In contrast, for CT-Max/Min, the L-adaters_CN have higher scores. This can be related to the fact that CodeNet has higher number of training code samples that include the tokens min or max. This number for Python in CodeNet is 47,388, compared to 11,268 for Python language in CSN. Similarly, there are 7,640 min and max Java tokens in CodeNet compared to 5,046 in CSN. As the only prediction here is over two tokens, L-adaters_CN achieve higher scores.

Table 3. Accuracy scores of the models on CT. Best scores are bold and the second high scores are underlined.

Model	Python		Java
Model	W/ NL	W/O NL	W/ NL	W/O NL
CT-Max/Min
RoBERTa	59.18	59.73	59.75	59.13
L-adapter_CN	71.84	72.31	71.78	71.78
L-adapter_CSN	66.30	66.54	66.81	66.81
CodeBERT_MLM	79.27	77.93	91.08	89.01
CT-All
RoBERTa	54.49	54.56	50.75	51.15
L-adapter_CN	66.05	66.39	61.37	61.11
L-adapter_CSN	74.35	75.87	75.63	76.45
CodeBERT_MLM	83.34	83.23	75.53	74.81

5.2. RQ2: Adapters’ Ability for Cross-Modal Transfer to Downstream Task

The results of MODE-X for code clone detection are shown in Table 4. The programming language of the adapters in MODE-X is shown as superscript in the table, and the dataset that is used for training the L-adapters is shown as subscript. We followed the recommended settings of the baselines for fine-tuning. For CodeBERT, we provide the results for both variants of CodeBERT, pre-trained on MLM only and pre-trained on MLM + RTD.

For the T-adapters in natural language, it is reported that the last layer of the model learns the MLM objective (Pfeiffer et al., 2021b). So, they report that better results are obtained when L-adapters are dropped from the final layer, leaving only T-adapters in this layer. Therefore, we ran different experiments i) when we do not drop the L-adapters, ii) when dropping the L-adapters from layer 12, and iii) dropping L-adapters from layers 11 and 12. We report the best scores obtained. Similar to NLP adapters, the best results are obtained when L-adapters are dropped from the last or the last two layers. For BCB and SCD-88, the best scores are for the model with dropped L-adapter from its final layer, which has less than one score difference in other settings. The best results achieved for POJ-104 is by dropping the L-adapters from the last two layers, improving the results of all-layers setting by less than 3 scores.

For all three datasets, scores of MODE-X are between RoBERTa and CodeBERT that are fully fine-tuned on code clone detection, even surpassing the results of CodeBERT_MLM for BCB dataset. For C/C++ and Python datasets, the adapters’ scores are 4-5 MAP@R scores below CodeBERT_MLM+RTD. An interesting observation is that CodeBERT is not pre-trained on C/C++, but on other programming languages, and is only fine-tuned to C/C++ for clone detection task. The higher score of CodeBERT in this case is related to its learned knowledge from other programming languages. In comparison, RoBERTa have not seen any programming language during pre-training. But, adding the C/C++-adapters to its layers helps improve the model’s results for code clone detection. For Java language, adding Java-adapters to RoBERTa model improves the results of RoBERTa, which even is better than CodeBERT_MLM and is very close to CodeBERT_MLM+RTD. Note that Java is among languages that CodeBERT is pre-trained and fully fine-tuned on.

Table 4. Scores of the code clone detection for RoBERTa, CodeBERT, and MODE-X. The best scores are bold and the best scores of MODE-X are underlined.

Model	Dataset	Score
RoBERTa	POJ-104	81.52 (MAP@R)
MODE-X^C/C++_CN	POJ-104	82.40 (MAP@R)
CodeBERT_MLM	POJ-104	85.08 (MAP@R)
CodeBERT_MLM+RTD	POJ-104	86.48 (MAP@R)
RoBERTa	BCB	95.96 (F1)
MODE-X^Java_CN	BCB	96.43 (F1)
MODE-X^Java_CSN	BCB	96.61 (F1)
CodeBERT_MLM	BCB	96.38 (F1)
CodeBERT_MLM+RTD	BCB	96.65 (F1)
RoBERTa	SCD-88	73.90 (MAP@R)
MODE-X^Python_CN	SCD-88	75.65 (MAP@R)
MODE-X^Python_CSN	SCD-88	75.65 (MAP@R)
CodeBERT_MLM	SCD-88	80.71 (MAP@R)
CodeBERT_MLM+RTD	SCD-88	78.95 (MAP@R)

CodeBERT_MLM+RTD is trained on bimodal data, where the RTD objective is trained exclusively on source code, for all six programming languages of CodeSearchNet. So, the RTD explicitly injects source-code information into CodeBERT’s representation space. Although the impact of this dual objective may be unclear for code clone detection, CodeBERT_MLM+RTD achieves higher results over CodeBERT_MLM for other tasks (Feng et al., 2020). In our study also, CodeBERT_MLM+RTD has higher scores for two of the datasets compared to CodeBERT_MLM. In this study, we focus on evaluating the effectiveness of the cross-modal transfer abilities of the adapters. Therefore, we train the adapters solely on the MLM objective, hence, ideally would compare the adapters with CodeBERT_MLM. However, we provided the results for both variants for clarity. The MODE-X results are closer to the ones from CodeBERT_MLM, while being 60-140 times more parameter efficient in fine-tuning the parameters.

Another point to add here is that for training the T-adapters, we used the recommended hyperparamters on AdapterHub (Pfeiffer et al., 2020a). However, those hyperparameters are recommended for natural language. Therefore, we ran other experiments only for code clone detection on SCD-88 dataset. When different learning rates are used ( $5E-4$ here), the results of MODE-X are improved to 79 MAP@R, which is comparable to CodeBERT_MLM+RTD and exceeds the results of CodeBERT_MLM.

5.3. RQ3: Computational Efficiency of Adapters

The efficiency of the adapters are evaluated using i) their parameter budget, and ii) their memory usage. The parameter budget is the number of learnable parameters in the model. For adapters, as we do not re-train RoBERTa, the parameter budget is the number of parameters required for training adapters only. We report the memory and parameter budgets of the adapters for the entire 12 layers of the model, not a single adapter. Note that the numbers are the same, independent of the dataset they were trained on, as the architecture is the same. Figures 1, 4, and 5 show the parameter efficiency of the adapters compared to CodeBERT in millions.

The parameter budget for CodeBERT is 124.65 million, as it re-trains all the parameters of RoBERTa. The parameter budget for CodeBERT that is used for code clone detection is given by summing the number of parameters tuned for pre-training the model ( $\sim 124$ million) and number of parameters for fine-tuning the model along with the task-specific head for code clone detection ( $\sim 125$ million), adding up to 249.3 million parameters for clone detection on POJ-104 and SCD-88, and 250.48 million for BCB clone detection. This difference in the number of parameters for datasets in code clone detection is related to different formulation of this task for BCB compared to the other two.

The L-adapters and T-adapters have parameter budget of 7.39 and 0.89 million, respectively. Therefore, for code clone detection on POJ-104 and SCD-88, the number of parameters required for MODE-X is 8.28 (= 7.39 + 0.89). For code clone detection in BCB dataset, MODE-X requires more parameters, total of 9.46 million parameters. For task-specific fine-tuning, we only consider the parameters that are required for fine-tuning in CodeBERT and the parameters for training T-adapters in MODE-X (i.e., excluding the pre-training parameters of CodeBERT and parameters of L-adapters). For task-specific fine-tuning, adapters are 60.7 (= (250.48-124.65)/(9.46-7.39)) times and 140.05 (=(249.3-124.65)/0.89) times more parameter efficient than CodeBERT on BCB, and POJ-104/SCD-88, respectively. When considering the overall budget, i.e. the number of parameters required for training and fine-tuning CodeBERT and the number of parameters used for training L-adapters and T-adapters, adapters are 26.47 to 30.1 times more parameter efficient than CodeBERT for code clone detection, and 16.86 times more parameter efficient than CodeBERT for cloze test task, as CT does not require fine-tuning³³3CodeBERT is pre-trained using RoBERTa as initialization. If adding the cost of training the RoBERTa for the adapters to the parameter budget, we must do so for CodeBERT. Instead, we consider that both approaches use RoBERTa as initialization and describe the parameter budget as the total number of parameters trained..

The memory usage of the adapters and CodeBERT are shown in Table 5. The “Memory” column represents the additional memory required for a new task. The “% Model” column shows the additional memory usage over the RoBERTa model as a fraction of its memory budget. For example, when CodeBERT is fine-tuned for a new task, the whole model which is 477.98 MB should be saved. This is compared to the required memory for MODE-X, 31.63 MB, which sums up the memory usage of L-adapters and T-adapters. For CodeBERT, the whole model should be saved again (100%), which is compared to 5.28% (=(28.20/477.98)*100 ) of the RoBERTa model for L-adapters and less than one percent for T-adapters. As can be seen, MODE-X is over 15 times more memory efficient than CodeBERT, as in contradiction with the pre-trained model, CodeBERT, during fine-tuning only the adapters need to be loaded in memory.

Table 5. Memory usage of adapters compared to C-PTLM.

Model	Memory (MB)	% Model
CodeBERT	477.98	100
L-adapters	28.20	5.89
T-adapters	3.43	0.72
MODE-X	31.63	6.62

It is worth mentioning that pre-training CodeBERT needs 384 hours of training on 16 interconnected V100s for a batch size of 64 (Feng et al., 2020). In contrast, L-adapters need 35 hours of pre-training on a single V100 GPU for the same batch size (mentioned in supplementary materials). Moreover, fine-tuning CodeBERT required 2.5 hours for CCD and less than an hour for T-adapters. As adapters are significantly more parameter efficient, they have higher inference time, which emphasizes their usage in practice.

5.4. RQ4: Dropout Experiments

We study the performance of the model with L-adapters, when they are added incrementally to each layer of RoBERTa. Due to the architecture of adapters, we are not able to use adapters in each layer separately, as they should receive input from the previous layers. So, we use the incremental set up for this experiment. The accuracy scores of the model with L-adapters are shown in Figure 6. The dropped layer number on the x-axis shows that we start dropping the L-adapters from that layer. For example, for each layer i, the L-adapters are inserted to all of layers 1 to i and are dropped from layers i+1 onward in the model, which is then tested for cloze test. The left column are plots of Java-adapter and the middle column are related to Python-adapter. The results are represented for L-adapters_CN in solid blue line and the L-adapters_CSN in solid yellow line. The red and green dashed lines are the accuracy of CodeBERT and RoBERTa respectively. The ND after layer 11 in the plots stands for no-drop, meaning that L-adapters are inserted in all layers. The right column of this figure shows the average results of the L-adapters and PTLMs when tested on all of the programming languages that are available for CT on CodeXGLUE: Java, Python, Ruby, Go, PHP, and JavaScript (More in Section 6).

An interesting observation here is the difference in the behavior of the L-adapters trained on CN and CSN. When the adapters are trained on CodeSearchNet, there is a small increase in the results for CT-Max/Min until layer 10. Even for CT-all. the accuracy of the L-adapters until layer 10 is close to zero. There is an increase in the scores in layer 11, and a significant jump from layer 11 to when they are inserted in all 12 layers of RoBERTa. This plateau seen for CodeSearchNet and the sudden increase in the last layer could be related to the fact that CT tasks are from validation and test sets of the same dataset. But, we could not find an explicit explanation of this behavior, noting that we are sure about training the L-adapters (not overfitting) and the generated results. In contrary, there is an increasing trend for model with L-adapters_CN for both CT-Max/Min and CT-all. The L-adapters_CN for CT-Max/Min exceed the accuracy of RoBERTa after layer 4 and 7, in Java and Python, respectively. A similar trend is seen both for Java and Python adapters after layers 8 and 7 for CT-all. As CT is considered as a probing task, it shows that the deeper adapter modules learn better semantics about the programming language than the ones used in initial layers, which is confirmed by the increasing trend in these plots.

6. Discussions

Zero-Shot Setting: We ran several experiments to test the ability of the L-adapters in zero shot setting, i.e., is tested on an unseen language, with the same set up as for the drop out experiment. For example, the Java-adapter is tested for CT task for another programming language, such as Go. We could not apply zero shot setting for code clone detection, because the custom heads used on top of the model for code clone detection cannot be transferred to another dataset due to the difference in the evaluation metrics (F1 and MAP@R) and the difference in the R value for POJ-104 and SCD-88 datasets. Therefore, the experiments are on cloze test task. We applied each of the L-adapters for CT-All and CT-Max/Min task, for all of the six programming languages that are available on CodeSearchNet, Python, Java, Ruby, Go, PHP, and JavaScript. We applied it both for W/ and W//O NL tokens (See RQ1) and for the L-adapters_CN and L-adapters_CSN. There is not much difference between the results of W/ NL or W/O NL, so we only represent the results with NL here. The two plots in the right column of Figure 6 compare the average scores obtained in this setting by L-adapters (i.e., averaging the scores when tested on each PL separately) and the average scores of RoBERTa and CodeBERT. Due to space limitation, we cannot provide all the scores for each language in the test set separately here, but include all tables in the supplementary document. The solid lines are the scores related to L-adapters and PTLMs scores are shown in dashed lines.

Interestingly, the results of L-adapters_CN for CT-all are below the accuracy obtained by RoBERTa. This might be related to the fact that when L-adapters are inserted in RoBERTa, the model is tuned to learn about the programming language. Specifically, CT-all task is more related to the syntactic representation of the programming language, as almost all of the vocabulary used in this task are identifiers (We will discuss this point in next section). Hence, as the adapters also learn about the syntactic representation of a specific programming language, when they are tested on other languages with different syntax, they perform poorly. The reason lies in the fact that L-adapters are trained on a single language and aim to amplify the signals specific to that language (source language, e.g. Java). As the syntactic representations learned in the embedding space are modified by the inversion layers, it becomes impossible to amplify the signals of a specific language (target language, e.g. Go) without having an impact on the representations learned by the model over the others (source language). In the zero-shot setup, neither the adapters nor the PTLM have seen any of the data points from the other 5 training languages. Hence, the detriment of performances in comparison to RoBERTa are expected.

When models with L-adapters are applied on CT-Max/Min task, the L-adapters are exceeding RoBERTa’s scores, meaning that they are able to learn about semantics of programming languages. The difference here is that the model with L-adapters cannot learn about the syntax of a language without having seen any instances of it. In contrast, the model can perform well on semantics having learned the semantics of one language and transferring the common elements to the unseen languages. For L-adapters_CSN, a similar behavior to what we explained in RQ4 is seen. For both CT-all and CT-Max/Min, the model with L-adapters in all 12 layers exceeds the RoBERTa’s score. CodeBERT performs well in this setting, as the results for CodeBERT are not zero shot scores.

Exploring CT-all: Based on the difference in the results of zero shot setting for CT-all and CT-Max/Min, we investigate more about the CT-all task. Originally, CT-all was introduced on CodeXGLUE as a second cloze-style probing experiment, and as a generalized extension of CT-Max/Min which includes 930 tokens. The experimental design of CT-all however is data-driven and does not involve the experts’ annotation (unlike CT-Max/Min). Therefore, we generated the Abstract Syntax Trees (AST) of each of the code samples in the experiments for all six languages, using an open-sourced parsing library tree-sitter⁴⁴4https://tree-sitter.github.io/tree-sitter/. We then extracted all the name entities, known as code entities, for the entire CT-all vocabulary using their labels from ASTs. Our manual analysis of these labels show that almost all of the words included in the vocabulary are tagged as identifiers with only a few words being labeled as float-identifier or integer-identifier. Identifiers are known as syntax of code and used for representing the syntactic representation of code (Drain et al., 2021; Wang et al., 2021). This shows that CT-all is evaluating the syntactic representation of code, and this confirms our obtained results and supports the discussions of RQ4 and zero shot experiment.

GPU Requirements: Although we use V100 GPUs to train adapters, we have run pilot studies on Google colab to train adapters, which were successful. The reason why adapters can be trained on a smaller GPU is that all the layers of the transformer are frozen, which are not stored on GPU. So, with a fixed GPU, we can use a larger model with adapters whereas we would have to settle with fine-tuning a smaller model.

7. Threats to Validity

External Validity relates to the generalization of the results. This study was done on two code-code tasks and three programming languages. So, the results might not be generalizable to other tasks and programming languages. Though, similar results could be obtained, which requires more studies.

Internal Validity is related to having unanticipated relationships. One of the threats can be related to training the models. The authors who trained the models have experience with adapter modules for natural language and have theoretical and technical knowledge of NLP. We used the CodeXGLUE benchmark, re-ran the experiments of the cloze test and code clone detection, and confirmed the differences of the results we obtained for the N-PTLM and C-PTLM with the authors of CodeBERT. To mitigate obtaining unwanted results, we used the publicly available datasets from this benchmark platform, and followed all of the steps mentioned in their piepline to evaluate the models. We also trained the L-adapters with an additional dataset from IBM.Additionally, we conducted pilot studies to find the best set up for the adapters and baselines. A threat here can be related to the results obtained for code clone detection for SCD-88 dataset, as we reformulated the score to MAP@R because this is a retrieval task. Although the results for this dataset are not contradicting to the other ones, we publish this dataset and all the scripts with this submission for replication purposes.

Construction Validity. Construction validity relates to what a test claims to measure and what it actually measures. Through our studies and based on the obtained results, we explored the ability of the models in zero shot setting through cloze test. Though this is used in previous works, it is valuable to test the capabilities of the models for other tasks in this setting.

8. Related Works

Inspired by Transformers (Vaswani et al., 2017) and PTLMs in NLP ((Devlin et al., 2018; Liu et al., 2019; Raffel et al., 2020; Zhang et al., 2020b), in software engineering there are several studies that use Transformer-based PTLMs for source code (Kanade et al., 2020; Feng et al., 2020; Buratti et al., 2020; Tufano et al., 2020; Lachaux et al., 2021; Guo et al., 2021). CuBERT (Kanade et al., 2020) and CodeBERT (Feng et al., 2020) pioneered pre-training a BERT model (Devlin et al., 2018) for code. Consequently C-BERT (Buratti et al., 2020) and CodeTrans (Elnaggar et al., 2021), based on T5 (Raffel et al., 2020) were introduced. Roziere et al. (Lachaux et al., 2021) present DOBF, an MLM-based pre-training objective that encourages code comprehension. The authors of CodeBERT (Feng et al., 2020) were the first to incorporate bimodal pre-training for source code; learning from NL-PL pairs using CodeSearchNet corpora (Husain et al., 2020). Concurrently, Tufano et al. (Tufano et al., 2020) showed that BART, a denoising autoencoder-based Transformer (Lewis et al., 2020), initially pre-trained on large English corpora and subsequently on a large corpus of source code, can be fine-tuned for generating assert statements for unit tests. Drain et al. also use pre-trained Transformer for generating bug fixes (Drain et al., 2021). CLSEBERT is recently developed and is used for four tasks including code clone detection (Wang et al., 2021). Although many PTLMs are developed to represent source code, they share a common property: they should be fine-tuned separately for each of the downstream tasks. This brings an issue when scaling up to many tasks is required, as an entirely new model is needed for every target task. Moreover, in multilingual PTLMs like CodeBERT, the model is bound to learn features that help all of its domain languages while discouraging representations that do not. Thus, this is bound to suffer from the “curse of multilinguality” as one begins to scale up the model to include more languages (Conneau et al., 2020).

NLP researchers have recently explored other avenues of efficient knowledge transfer to eliminate the shortcomings associated with the fine-tuning of large-scale PTLMs. The compact and extensible bottleneck layers known as adapters are one of the main techniques (Houlsby et al., 2019). In terms of parameters, adapters are a small fraction of the original Transformer’s size, and the Transformer’s parameters are frozen while training adapters. This makes adapters scalable. A number of adapter-based frameworks ranging from language-focused (Artetxe et al., 2020; Pfeiffer et al., 2020b) to task-focused (Bapna and Firat, 2019; Pfeiffer et al., 2021a) approaches are proposed. Bapna et al. (Bapna and Firat, 2019) demonstrate the use of adapters in domain adaptation for Neural Machine Translation and employ them in a multilingual setting. Artetxe et al. (Artetxe et al., 2020) transfer a monolingual PTLM to an unseen natural language via adapters. Consequent studies reveal the advantages of using multiple distinct task-specific adapters to disentangle different elements of knowledge relevant to the target domain of the downstream task (Pfeiffer et al., 2021a) and stacking task- and language-specific adapters for effectively adapting a multilingual model to a new unseen natural language (Pfeiffer et al., 2020b).

Although there are many studies on PTLMs for source code and exploring adapters for NLP, there is no attempt to extend the adapters to other modalities, nor there is a work that utilizes adapters for programming languages and software engineering tasks, which we explore in this paper.

9. Conclusion and Future Works

In this paper, we studied transferring the learned knowledge of pre-trained neural language models in bimodal setting, from natural language to programming language, and assessed the ability and the efficiency of the models through code-code tasks. Adapters improve results of RoBERTa and can have close performance with CodeBERT in some cases. Training and fine-tuning adapters require significantly lower number of parameters, with less memory storage and faster inference time. Thus, adapters can be used in software engineering to scale the models up for many tasks and languages, making them beneficial in practice. Adapters are considered as plug and play modules, and can be replaced with one another specially for low resource languages. We plan to study this characteristic and training multi-lingual adapters for source code next.

Replication Package. The replication package including SCD-88 dataset, scripts and models is available at https://github.com/fardfh-lab/NL-Code-Adapter.

Acknowledgements.

This research is support by a grant from Natural Sciences and Engineering Research Council of Canada RGPIN-2019-05175 and Mitacs Globalink award, 2021.

References

(1)
Allamanis et al. (2018) Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37.
Artetxe et al. (2020) Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4623–4637.
Bapna and Firat (2019) Ankur Bapna and Orhan Firat. 2019. Simple, Scalable Adaptation for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 1538–1548.
Buratti et al. (2020) Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, and Giacomo Domeniconi. 2020. Exploring Software Naturalness through Neural Language Models. arXiv preprint arXiv:2006.12641 (2020).
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. International Conference on Learning Representations, ICLR (2020).
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Drain et al. (2021) Dawn Drain, Chen Wu, Alexey Svyatkovskiy, and Neel Sundaresan. 2021. Generating Bug-Fixes Using Pretrained Transformers. In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming. 1–8.
Elnaggar et al. (2021) Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv preprint arXiv:2104.02443 (2021).
Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547.
Guo et al. (2021) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCode{BERT}: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations (ICLR).
Hindle et al. (2016) Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
Husain et al. (2020) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv preprint arXiv:1909.09436 (2020).
Kanade et al. (2020) Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and evaluating contextual embedding of source code. In International Conference on Machine Learning. PMLR, 5110–5121.
Lachaux et al. (2021) Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, and Guillaume Lample. 2021. DOBF: A Deobfuscation Pre-Training Objective for Programming Languages. Advances in Neural Information Processing Systems 34 (2021).
Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1eA7AEtvS
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. (2020), 7871–7880.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv preprint arXiv:2102.04664 (2021).
Musgrave et al. (2020) Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. 2020. A metric learning reality check. In European Conference on Computer Vision. Springer, 681–699.
Perez and Chiba (2019) Daniel Perez and Shigeru Chiba. 2019. Cross-language clone detection by learning over abstract syntax trees. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 518–528.
Pfeiffer et al. (2021a) Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2021a. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 487–503.
Pfeiffer et al. (2020a) Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020a. AdapterHub: A Framework for Adapting Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 46–54.
Pfeiffer et al. (2020b) Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020b. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7654–7673.
Pfeiffer et al. (2021b) Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2021b. UNKs Everywhere: Adapting Multilingual Language Models to New Scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10186–10203.
Philip et al. (2020) Jerin Philip, Alexandre Berard, Matthias Gallé, and Laurent Besacier. 2020. Language adapters for zero shot neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4465–4470.
Puri et al. (2021) Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. 2021. Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. arXiv preprint arXiv:2105.12655 (2021).
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
Tufano et al. (2020) Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers. arXiv preprint arXiv:2009.05634 (2020).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
Wang et al. (2021) Xin Wang, Yasheng Wang, Pingyi Zhou, Meng Xiao, Yadao Wang, Li Li, Xiao Liu, Hao Wu, Jin Liu, and Xin Jiang. 2021. CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model. arXiv preprint arXiv:2108.04556 (2021).
Zhang et al. (2020b) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020b. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328–11339.
Zhang et al. (2020a) Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020a. Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 70–80.
Zhu et al. (2021) Yaoming Zhu, Jiangtao Feng, Chengqi Zhao, Mingxuan Wang, and Lei Li. 2021. Counter-Interference Adapter for Multilingual Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2812–2823.