This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning code summarization from a small and local dataset

Toufique Ahmed University of California, DavisDavisCaliforniaUSA95616 [email protected]  and  Premkumar Devanbu University of California, DavisDavisCaliforniaUSA95616 [email protected]
(2018)
Abstract.

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.

deep learning, same-project training, code summarization, transfer learning
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Machine learning applications in software engineering have been very successful in practice (e.g., Microsoft’s CoPilot) and also on a wide range of more advanced applications (Lu et al., 2021). Recently, there has been a great deal of interest in foundation models (Feng et al., 2020; Wang et al., 2021; Guo et al., 2020; Kanade et al., 2020; Ahmad et al., 2021) which subjects a very highly parametrized, high-capacity neural model to a two-phase training regime. The first unsupervised “pre-training” phase is done with an enormous corpus, using a simple fill-in-the-blanks or predict-the-next-token/sentence regime. This phase can be carried out on essentially any (unlabeled) code data harvested from the web. There is no task-specific goal here; the model simply learns the statistics of the input data. The second phase, fine-tuning, trains on-task, and requires carefully curated, consistently labeled data, consisting typically of input-output pairs reflecting good, consistent, on-task performance.

The challenges of creating well-curated, de-duplicated, and yet relevant training datasets have been described by several authors (Gros et al., 2020; Allamanis, 2019). Recent paper by Ahmed et al (Ahmed and Devanbu, 2022) and Chen et al (Chen et al., 2022) explore some of the issues that arise with finding sufficient quantities of high-quality find-tuning data. Data availability may be limited because: first, some languages (e.g., Ruby) are relatively less popular than other languages, and available high-quality data may be limited. Second, the projects in a language may be skewed towards one application domain (e.g., Javascript for the web) and thus the performance of the trained model maybe somewhat uneven. Finally, and most interestingly, when curating software engineering datasets, for on-task fine-tuning, yet another strange, unique, wrinkle arises: project specificity.

It’s well known that developers in different projects do behave somewhat differently; they use different terminology, different algorithms, and even different coding practices. As far back as 2009 (Zimmermann et al., 2009) it was observed that cross-project models don’t perform as well as in-project models on defect prediction tasks. These difficulties cross-over into language models; even the earliest paper on language modeling for code (Hindle et al., 2012) noted application-specific effects. Subsequent work by Tu et al (Tu et al., 2014) and Hellendoorn (Hellendoorn and Devanbu, 2017) noted the highly local, project- and even file-specific vocabularies of source code, and proposed ways to handle them.

This phenomenon offers an entirely new opportunity: Can project-specific training data improve performance? On the plus side, since vocabulary, coding styles etc are notoriously project-specific, training and testing on the same project should give better performance. This seems like an easy, low-hanging fruit. However, there are a couple of traps. First, when working within project, one has to be careful in partitioning training and test data, so that we only use data that would be realistically available in practice. Second, within-project data may be quite substantially limited in comparison to cross-project data. For this reason, within-project training regimes would require models that learn well from fewer samples.

For this reason, we believe that it would be useful to investigate approaches that would improve sample efficiency for the fine-tuning phase of foundation model training. By “improving sample efficiency” we mean the general goal of increasing the ability of machine-learning models to learn to perform better, with fewer training samples. For example, a model AA that reliably performs as well as model BB with much fewer training samples is a more “sample efficient” model. Sample efficient model AA both requires less data and potentially trains much faster: thus saving human effort, time, and energy usage. Most of all, in settings where high quality training data is not as abundant, model AA would be more attractive.

Finally, as a sort of “stress-testing” of the same-project tuning idea, we applied this to an extremely well-tuned code summarization model, to see if same-project training provided any improvement at all. For this we used the multilingual “PolyGlot” model, by Ahmed & Devanbu (Ahmed and Devanbu, 2022). They found that cross-project, multi-lingual training, using a very large fine-tuning set in many languages provided best-in-class performance (this “PolyGlot” model was the chart-topper on the CodeXGlue Leaderboard111 See https://microsoft.github.io/CodeXGLUE/. for a while, although CodeT5 has since reported better performance). We wondered whether even this extensively well-tuned model could be further improved on a specific project by further fine-tuning on the same project. One might expect that it wouldn’t, since it is already so well trained…but actually, it worked!

In this paper we consider same-project fine-tuning, for the task of code summarization, and make the following contributions

  1. (1)

    We investigate the benefits of within-project training, using a time-series scenario: we train only on “past” data, and evaluate on “future” data. In the code summarization setting, this reflects a realistic setting where a developer asks for the summary of a piece of code, and we train only data already available that specific time point in the history of the project. We find that within-project training offers some advantages.

  2. (2)

    Second, we adapt the GraphCodeBERT foundation model specifically to improve its sample efficiency for code summarization; the resulting GCBhybrid model, achieves high levels of sample-efficiency and can outperform the state of the art in some project-specific settings.

  3. (3)

    We also found that the “maximalist stress test”, adding project-specific fine-tuning to the already extensively fine-tuned “PolyGlot” model, actually provides further benefits, and yields the best performance, comfortably beating the state of the art CodeT5 model overall, with statistical and practical significance (pairwise Wilcoxon test, with a difference in means, over all test samples, of about 3.7 BLEU-4; this is above the 2.0 BLEU-4 threshold difference that humans are experimentally (Roy et al., 2021) known to notice).

  4. (4)

    Finally, while “PolyGlot”+same-project setting is most performant, we do find that same-project training is remarkably efficient; even the largest projects use less than 2.5% of the time taken for cross-project training, while attaining comparable performance to current state-of-the-art

The paper begins in the next section with some motivating explorations of project-specific phenomena relevant to the foundation model setting. Following that we present our methodology, followed by results, discussion, and related work. The paper ends with a brief speculation on future directions.

2. Background & Motivation

We begin with a brief overview of Foundation models (Bommasani et al., 2021), which are currently widely used in NLP and SE. Foundation models are trained in two stages (i.e., pre-training and fine-tuning). In the pre-training stage, we train the models with billions of unsupervised tokens to teach the models the statistics of the language in a self-supervised way, using simple tasks like auto-regressively predicting the next token, filling in a blank in context, completing the next sentence, denoising an input, etc. These tasks are performed in a multi-layer deep network; the intermediate layers thus learn a representation (“embedding”) of the salient patterns of input token sequences in the code. Later we take the embeddings of the input sequence learned in the pre-training stage and further train it with supervised data in the fine-tuning stage. This pre-training+fine-tuning paradigm was first introduced in NLP by Devlin et al. (Devlin et al., 2018). They proposed an encoder-only model BERT, pre-trained with two training objectives: Mask Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM is the most effective pre-training objective for encoder-only models where the model randomly masks out a certain percentage of tokens and unmasks them. Liu et al. showed that RoBERTA outperforms BERT using only MLM as a pre-training objective with some new training strategies (e.g., dynamic masking instead of static masking of a sequence) and hyperparameters tuning (Liu et al., 2019). BERT-style encoder-only models have inherent limitations for seq2seq generative tasks like Neural Machine Translations (NMT) because of the missing trained decoder. Two models, BART (Lewis et al., 2019) and T5 (Raffel et al., 2019) have well-trained decoders and perform well on seq2seq generative tasks.

These models are not designed for code; subsequent research has refined these models (retaining the same basic scheme) for code and code-related natural language description. CodeBERT (Feng et al., 2020) and GraphCodeBERT (Guo et al., 2020) are similar to the BERT model pre-trained with MLM and some code-specific pre-training objectives. PLBART (Ahmad et al., 2021) and CodeT5 (Wang et al., 2021) are replications of BART (Lewis et al., 2019) and T5 (Raffel et al., 2019) specially designed for SE tasks. Code-specific pre-trained models perform quite well on several SE tasks, including code summarization. The standard benchmark dataset CodeXGLUE (Lu et al., 2021) is used to evaluate these models. The CodeXGLUE includes a de-duplicated code summarization dataset prepared by modifying the CodeSearchNet (Husain et al., 2019) dataset. CodeXGLUE is a multilingual dataset (consisting of data from six languages) and has between 25K and 252K cross-project training samples for each language. Though some languages have relatively smaller samples (i.e., Ruby and JavaScript), other have very large training datasets (i.e., Java and Python).

Ahmed and Devanbu (Ahmed and Devanbu, 2022) have recently shown that multilingual training is beneficial for code summarization. Identifiers play a significant role in ML-based summarization, and they are mostly preserved across languages; this phenomenon enables cross-language training to work well. This prior work suggests that if methods from the same projects share similar identifiers, then same-project training can benefit the model. However, there are some issues when using same-project data. First, to be realistic, we can only use data as it becomes available; thus at any point time, only past data in the same project is available; we cannot use data on classes, methods, etc that haven’t been created yet. Thus we perform all our evaluations below in a “time-series” or “time-partioned” setting. Second, and following from this time-partitioned train-and-test approach, sample sizes get limited. Some times there is no more than a few hundred samples for each project, which differs greatly from the cross-project, cross-language setting where hundreds of thousands of instances can be used to fine-tune the models. If the pre-trained models are sample-efficient, then same-project training can be proven effective for code summarization task. We will briefly look into the two preliminary, motivating questions (PMQs) :

PMQ 1:

Do the different samples in the same project share more identifiers than samples drawn from random projects? If this were the case, one might hope that training with same project data would be especially advantageous.

PMQ 2:

Are the high capacity pre-trained models sample efficient? If these models were not especially sample-efficient, then (because same-project data might be as abundant) we might have difficulty exploiting any same-project data synergies.

PMQ 1: Are identifiers preserved across same-project samples? We conjecture that the same-project samples have higher identifier similarities because of domain and API usage similarity. Same-project samples also will use overlapping sets of user-defined class objects and identifiers. To evaluate this question, we perform a small experiment using time partioned data. We take five projects from the Java CodeXGLUE code summarization dataset, each with at least 200 samples and sort them according to their creation date. We perform the following steps.

  1. (1)

    Divide the first 200 samples of each project into two groups (i.e., 1-100 and 101-200).

  2. (2)

    Take each group and find the unique case-insensitive identifiers of the group. We repeat it for all five projects.

  3. (3)

    Take group I of project A and calculate the Jaccard index with group II of the same project.

  4. (4)

    Now pair group I of project A with group II of all other projects and calculate the Jaccard index222Jaccard Index is calculated as XYXY\frac{\mid X\cap Y\mid}{\mid X\cup Y\mid}.

  5. (5)

    Repeat steps 3 and 4 for all projects and observe the Jaccard Index difference.

Table 2 shows that we get the highest Jaccard indices (always 2-5 times higher than other positions) in the diagonal position where the data of both groups are coming from the same projects. Hence, same-project samples have higher identifier similarities.

Observation 1. Same-projects samples likely to exhibit more identifier similarity than the cross-project samples.

PMQ 2: What is the fine-tuning sample efficiency of foundation models? To observe the sample efficiency of foundations models, we consider two best performing models from each family of models (i.e.,, GraphCodeBERT from BERT-type encoder-only models and CodeT5 from seq2seq generative models). We also introduce a hybrid model GCBhybrid in this paper (described in more detail below), where we cascade the GraphCodeBERT encoder with a pre-trained decoder. For this experiment, we use the java CodeXGLUE code summarization dataset. Note that CodeT5 is the best performing model for this task achieving 20.32 BLEU-4, where GraphCodeBERT reaches 19.22 BLEU-4 We sample datasets of different sizes (10-300 examples) and observe the cross-project performance of the three models. Table 1 presents that with 300 code-project samples, CodeT5 achieves 18.23 BLEU-4 which is only about 2 BLEU-4 lower than what it performs with the complete dataset of  165k samples. This, with about 550 times as much data! Therefore, we can conclude that models like CodeT5 are fairly sample-efficient and can perform well with a few data samples. However, CodeT5 struggles to summarize code well, when less than 150 training samples are available.

In the same-project fine-tuning scenario, this situation is quite common, in many projects, as we argue later. Happily, our GCBhybrid model is highly sample efficient, and attains two-digit BLEU-4 even with ten examples. This is because GCBhybrid’s pre-trained decoder is especially trained to generate (denoised) comments. GCBhybrid dominates the CodeT5 until 150 samples become available.

#of samples GraphCodeBERT GCBhybrid Codet5
10 4.88 11.37 1.38
50 9.29 13.7 1.84
100 10.02 14.73 2.32
150 10.33 14.98 14.93
200 10.57 15.51 18.64
250 10.73 15.63 18.81
300 10.58 15.71 18.23
Complete (\sim165k) 19.22 19.97 20.32
Table 1. Fine-tuning Sample-efficiency of Foundation models
Observation 2. Pre-trained models can be adapted to be fine-tuning sample-efficient; such models are competitive with State-of-the-art for the code summarization task when samples are limited

The following sections will discuss same-project training for code summarization using time series data and observe whether it can outperform the cross-project training performance with a few examples.

Projects Group II
oblac/jodd wildfly/wildfly orientechnologies/orientdb Unidata/thredds ngageoint/geopackage-android
Group I oblac/jodd 0.16 0.08 0.08 0.06 0.05
wildfly/wildfly 0.06 0.16 0.06 0.05 0.03
orientechnologies/orientdb 0.07 0.07 0.17 0.05 0.05
Unidata/thredds 0.07 0.06 0.06 0.10 0.04
ngageoint/geopackage-android 0.05 0.04 0.05 0.05 0.19
Table 2. Intra and inter project identifier overlap

3. Methodology

This section briefly describes the dataset preparation and foundation models we used for the evaluation.

3.1. Dataset Preparation

To evaluate the potential of same-project, sample-efficient training, we prepare a new dataset from CodeXGLUE benchmark dataset. There are three reasons for choosing CodeXGLUE

  1. (1)

    The dataset is known to be appropriately de-duplicated, thus avoiding issues raised in prior work (Allamanis, 2019; Shia et al., 2022)

  2. (2)

    We can more easily baseline our approach; most foundation models have been evaluated on this dataset.

  3. (3)

    This dataset provides the complete path to all the functions with commit ID and line number. Using this information, we can find out the creation date of that particular function, and perform time-series partitioning.

Preparing a same-project dataset, and then partitioning for training and test, has to be done carefully, to avoid the risk of possible data leakage from future to past instances during evaluation. Therefore, we perform time-series partitioning: we sort the samples from each project according to the creation date and perform an 80:20 split. 80% of data are used for training, and later 20% are randomly divided into test and validation sets. Note that we could not get the exact 80:20 splits for all projects because several functions share the same creation date at the point of splits. Now we will briefly discuss the process followed for assigning creation date to the examples and number of project used for the evaluation.

Assigning creation date to functions As already mentioned, CodeXGLUE provides the commit ID and path to the original function with specific line numbers. We use ”git blame –ignore-rev” to extract the first commit date of a specific line. Note that most of the functions are multi-line, and we use start and end line numbers in the command to get the creation dates for all the lines. We consider the earliest date as the creation date of the complete function. We follow such a strategy because creation dates differ by line. Consider the following code snippet.

public static BeanCopy from(final Object source) {
BeanCopy beanCopy = new BeanCopy(source);
beanCopy.isSourceMap = source instanceof Map;
return beanCopy;
}

Figure 1. Example for assigning creation date

For the example presented above (Fig. 1,) we get the same time-stamp (2015-08-26 12:57:28) for all the lines except the first one (2018-01-13 01:41:10). The approach we are following to extract the creation date has some limitations. “git blame –ignore rev” reports the earliest commit that changes a specific line. The first line was rewritten/edited from “public static BeanCopy fromMap(Map source) {” to “public static BeanCopy from(final Object source) {” on “2018-01-13 01:41:10”, almost 2.5 years later than original function creation date (2015-08-26 12:57:28). We are interested in the original creation date instead of the last commit that changed a single program line. Considering the change made to the first line doesn’t introduce any major change into the program. It does not even introduce or remove any identifier from the code. If we consider the times-tamps of all the function lines, that may help us get the original creation date. We consider the earliest commit time-stamp (2015-08-26 12:57:28) as the function creation date, for example presented in Fig. 1.

There is one case when this approach will fail. If all the program lines have been modified over time at least once, we will fail to predict the actual creation date of the program because “git blame –ignore rev” will not be able to report the original creation date for any line of the program. However, this is very unlikely to happen. Another challenge is that we can only track down the history recorded on GitHub. Our approach will fail if the programs are created and edited on a local machine and then dumped into GitHub. However, we believe our system will still give a fair amount of time-sorted instances and can be used to evaluate the same-project data. Note that we are following a very conservative rule while splitting the project. Multiple functions with the same creation time-stamp will not appear on the training set and validation/test sets.

Prepared datasets for different instance ranges While creating the same-project dataset, we could only use the test data from CodeXGLUE dataset. Pre-trained models (i.e.,, GraphCodeBERT, GCBhybrid and CodeT5) are trained using the CodeSearchNet dataset, and there is a possibility of data leakage if we use the training set. We avoided the validation set because pre-trained models were evaluated on the validation set, and those models are likely to do well if the projects are taken from the validation set. We consider two popular programming languages (i.e., Python and Java) for dataset generation.

Observing our experiments regarding sample efficiency, we divided our projects into three different training instance ranges.

  1. (1)

    Category I: Projects with 150+(more than 150) training samples (CodeT5 outperforms GCBhybrid).

  2. (2)

    Category II: Projects with 100-150 training samples (GCBhybrid performs well in that range).

  3. (3)

    Category III: Projects with 100-(less than 100) training samples (none of the models shows impressive performance).

Table 3 presents the number of projects from each category. We selected 34 projects from 2 programming languages.

Category Language
Java Python
Category I 10 7
Category II 6 7
Category III 2 2
Total 18 16
Table 3. Number of projects from each category

3.2. Foundation Models

In this section, we briefly describe the foundation models that we use to compare the performance of cross-project and same-project training.

GraphCodeBERT CodeBERT is one of the first BERT-type encoder-only pre-trained models specially designed for Software Engineering tasks. CodeBERT pre-trained with two objectives: i) MLM and ii) Replaced Token Detection (RTD). Though CodeBERT is successful in many downstream tasks, it does not use any code-related property. Guo et al. (Guo et al., 2020) describe an encoder-only foundation model, GraphCodeBERT, which uses two additional pre-training objectives (i.e., edge prediction and node alignment) to MLM. The first additional task is to predict code structure edges, and the second aligns representations between source code and code structure. These two objectives incorporate data flow in the pre-training stage, a semantic-level code structure that encodes the relation of “where- the-value-comes-from” between variables. We evaluate GraphCodeBERT in this paper to evaluate the effectiveness of same-project training because GraphCodeBERT outperforms CodeBERT in all downstream tasks, including code summarization.

CodeT5 CodeT5 (Wang et al., 2021) is a unified pre-trained encoder-decoder Transformer model well-suited for the seq2seq generative task. This model is pre-trained with three objectives: i) Masked Span Prediction (MSP), ii) Identifier Tagging (IT), and iii) Masked Identifier Prediction (MIP). CodeT5 learns improved embedding by leveraging the code semantics conveyed from the developer-assigned identifiers. It also achieves the state-of-the-art performance in CodeXGLUE code summarization task. Both GraphCodeBERT and CodeT5 are primarily pre-trained with CodeSearchNet dataset. However, CodeT5 is pre-trained with some additional C and C# datasets.

GCBhybrid

Refer to caption
Figure 2. Steps for preparing GCBhybrid

In the sample-efficiency experiment, presented earlier (Table 1), GraphCodeBERT reached only 10.58 BLEU-4 even after fine-tuning with 300 samples. Even the current SOTA, CodeT5, underperforms until we fine-tune with atleast 200 samples. However, both models do relatively well on the complete Java dataset, reaching 19.22 and 20.32 BLEU-4, respectively. Why do the pre-trained models underperform, with smaller fine-tuning datasets, even after training with billions of unsupervised tokens? GraphCodeBERT does not have a pre-trained decoder; it learns to generate comments only during fine-tuning. Thus, it cannot produce good comments until it’s seen a large number of samples. On the other hand, CodeT5’s pre-training also trains the decoder; however, it’s trained to “refill’ masked-out span of code, rather than a complete natural language description. The PLBART model is trained to denoise code and natural language description; but it failed to outperform GraphCodeBERT, even with its trained decoder, on Java code summarization (18.45 BLEU-4). We propose a hybrid model GCBhybrid where we cascade the pre-trained GraphCodeBERT model with a specialized decoder pre-trained to denoise natural language description. Such a decoder help the model to do well on code summarization task by incorporating prior knowledge for generating natural language description.

Like GraphCodeBERT and CodeT5, We use CodeSearchNet dataset for training the decoder. We use only the given training partition of the CodeSearchNet to prevent any data leakage in the fine-tuning stage because our final test dataset is taken from the test partition of CodeSearchNet. We got approximately 2M natural language descriptions to proceed. Following BART, we implement five noising modes. Note that we apply two different types of noise modes to each sample to enhance the dataset. In the next segment of the paper, we will briefly explain the noise modes.

Comment permutation With this noise mode, we take one code-related natural language description/comment at a time and shuffle the tokens in random order.

Comment rotation We randomly choose one token at a time and rotate the comment to bring that token to the first position of the statement. Repairing such noise helps to model to pick the starting of the comments.

Token deletion We randomly choose 15% of the tokens and drop them from the comment. The model’s task is to recover those dropped tokens and generate natural comments.

Token masking Like token deletion, we randomly mask out 15% of the tokens and ask the model to recover them and generate the comment using the decoder. Token masking is a comparatively easier task than token deletion. In token deletion, the model needs to learn both position and content of the missing token.

Token infilling We select a random span with span lengths drawn from a Poisson distribution (λ=3\lambda=3). We replace the span with a single token <mask>. The model will recover the complete missing span and learn about predicting the number of missing tokens in the span.

Sequence Type Sequence
Original Return next line with tag masked with whitespace .
Comment permutation with line . masked whitespace next tag with Return
Comment rotation masked with whitespace . Return next line with tag
Token deletion Return line tag masked with whitespace .
Token masking Return next <mask> with <mask> masked with whitespace .
Token infilling Return <mask> tag masked with whitespace .
Table 4. Denoising natural language description

Table 4 illustrates a comment mutated with the 5 noise modes. For training the decoder, we cascade a RoBERTa encoder with a newly created, 12 layers transformer decoder model. To make this hybrid model work, we need to ensure both encoder and decoder use the same vocabulary. Therefore, we use the original GraphCodeBERT vocabulary for this denoising task. To accelerate the training process, GraphCodeBERT was initialized with CodeBERT, and CodeBERT was loaded with the weights from the natural language RoBERTa. We also initialized our encoder with GraphCodeBERT and continued the denoising task for three epochs. Figure 2 depicts the steps involved in preparing GCBhybrid. We startwith with (a) the pretrained GraphCodeBERT model and (b) untrained 12 layers transformer decoder. These are adjoined (c) together and trained together (d) for the de-noising task described above. After sufficient de-noising performance is achieved, the decoder has become pretty good at generating comments. Now we detach just the decoder (e) and adjoin it with the pre-trained original GraphCodeBERT encoder, to create the (f) GCBhybrid model. We drop the fine-tuned encoder because it is subject to “catastrophic forgetting” of the knowledge learned in the base model (Kirkpatrick et al., 2017).

Stress test: the “PolyGlot” Model Finally, as a stress-test of the same-project training approach, we wondered if even a very extensively fine-tuned model such as 𝒫olyglot{\mathcal{P}olyglot}GraphCodeBERT (Ahmed and Devanbu, 2022), which fine-tuned on enormous (more than 900K) sample dataset, incorporate a diverse sample of project in many languages, could actually benefit from fine-tuning on relatively small number of same-project samples. For this, we took the published 𝒫olyglot{\mathcal{P}olyglot}GraphCodeBERT model, and ran a few epochs of fine-tuning on same project data, and evaluated in on same-project test data. We could do this without fear of training-test data overlap, because the data partitions provided by CodeXGLUE for pre-training and fine-tuning guaranteed that this “PolyGlot” model had not seen these projects during its previous pre-training and fine-tuning.

Complete pipeline and baselines We are primarily investigating performance relative to the cross-project models, that are fine-tuned with hundreds of thousands of instances, with same-project models fine-tuned only a few hundred samples. Figure 3 presents the complete pipeline of our approach. In the pre-processing stage, we separate the same project data using a “segmenter” and convert it to time-series data following the approach described in Section 3.1 in “creation date retriever” stage. After preparing data, we fine-tune four foundation models (i.e., GraphCodeBERT, GCBhybrid, “PolyGlot”, and CodeT5) for the code summarization task and compare them with the performance achieved by those three models in the cross-project setup (where there is no shortage of data).

Refer to caption
Figure 3. Complete pipeline for data generation and model training.

4. Results

In this section, we evaluate same-project training for the code summarization task, in different settings.

Fine-tuning cross-project baselines & same-project models As mentioned earlier ( section 3), we compare our approach with the models fine-tuned with abundant of cross-project data. We need to fine-tune the baseline models because we look for the BLEU-4 of a subset of test data while evaluating same-project training. The repository of the baseline models (e.g., GraphCodeBERT 333https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text and CodeT5444https://github.com/salesforce/CodeT5) only provide cumulative (corpus) BLEU-4, which we cannot map into results on the (test) subset within the same project. For GraphCodeBERT, we fine-tune the model with 32 batch size, as recommended by CodeXGLUE repository. However, we fine-tune the CodeT5 model with 24 batch size instead of 48 to fit the model into our Nvidia Titan RTX GPUs. We keep the other parameters unchanged. We also train our proposed GCBhybrid model with the cross-project data for Java (165k\approx 165k training samples) and Python (251k\approx 251k training samples). For same-project training, we replace the cross-project samples with same-project data and fine-tune the models using the same codebases used for fine-tuning the baselines.

During fine-tuning, the cross-project models stop improving after 10 epochs, but same-project models continue improving even after 20 epochs, because of the smaller training set. Therefore, we fine-tune the same-project models for 30 epochs. Following all the code relevant foundation models (Ahmad et al., 2021; Feng et al., 2020; Wang et al., 2021), we use smooth BLEU-4 (Lin and Och, 2004) as the evaluation metric.

4.1. Effectiveness of same-project training on Category I projects

Table 5 presents the results of the Category I projects (with 150+ training samples) for Java. We achieve 18.65, 18.83, 19.52, and 19.72 on average with GraphCodeBERT, GCBhybrid, CodeT5 and the ”PolyGlot” models on the Java dataset in the cross-project fine-tuning setup. PolyGlot very slightly outperforms the other two models. In the same-project setup, we can see the encoder-only GraphCodeBERT lags, because the untrained decoder has too few samples from which to learn. However, same-project CodeT5 and GCBhybrid perform really well, achieving 22.71 and 22.69 BLEU-4; the ”PolyGlot” model excels, refinings it’s already extensive multilingual fine-tuning to reach 25.87 BLEU-4. All models significantly improves over their cross-project counterpart (20.6% for GCBhybrid and 16.2% for CodeT5). Roy et al. (Roy et al., 2021) reported that less than 2 points do not guarantee systematic improvements in summarization quality and are not trustworthy as proxies of human evaluation. In this category, our best model shows over 6 BLEU-4 improvement over CodeT5 and the ”PolyGlot” model, which is 300% the Roy et al. threshold. Hence, same-project training introduces systematic improvement in code summarization task.

Projects Number of Samples Cross-project Same-project
Training Validation Test GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
oblac/jodd 913 114 114 17.98 17.93 16.98 17.63 14.26 20.54 20.71 20.21
wildfly/wildfly 356 44 45 12.53 13.57 16.16 15.32 10.41 13.93 14.67 14.92
orientechnologies/orientdb 346 43 44 15.21 14.21 16.22 14.51 11.16 16.05 15.8 16.52
Unidata/thredds 1341 167 167 14.18 15.89 16.36 15.11 11.82 16.68 16.07 17.26
ngageoint/geopackage-android 239 16 16 24.19 22.91 32.24 21.42 15.92 40.72 38.77 34.95
RestComm/jain-slee 184 25 25 14.24 16.21 17.33 12.87 5.35 13.20 17.33 16.85
OpenEstate/OpenEstate-IO 196 10 10 21.89 15.39 18.7 17.99 12.44 13.28 3.01 22.23
tiefaces/TieFaces 281 14 15 36.74 37.17 32.37 38.99 25.53 61.74 64.02 72.12
jboss/jboss-common-core 209 17 17 17.16 22.26 17.33 29.06 11.01 17.42 21.83 29.04
rupertlssmith/lojix 336 41 42 12.36 12.8 11.54 14.32 9.72 13.5 14.65 14.62
Average 18.65 18.83 19.52 19.72 12.76 22.71 22.69 25.87
Table 5. Effectiveness of same-project fine-tuning for code summarization task on category I Java projects

Same-project training also works for Category I python projects. As per Table 6, same-project ”PolyGlot”, on average, beats the next best prior work (CodeT5) by a solid 5.7 BLEU-4. Note that CodeT5 and GCBhybrid also improve on same-project training, while GraphCodeBERT does not.

Projects Number of Samples Cross-project Same-project
Training Validation Test GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
apache/airflow 435 53 54 16.16 16.41 17.11 18.14 10.33 16.39 17.96 19.46
tensorflow/probability 425 53 53 17.9 18.86 22.18 22.39 13.49 18.19 21.33 20.76
h2oai/h2o-3 215 26 27 13.72 15.03 16.29 14.17 10.69 14.31 16.25 14.59
Qiskit/qiskit-terra 376 43 43 23.13 22.81 24.27 19.93 17.65 21.24 24.13 25.65
chaoss/grimoirelab-perceval 188 24 24 14.59 11.69 14.79 11.12 26.26 38.7 36.82 50.27
PyCQA/pylint 271 33 34 17.91 17.89 18.91 19.27 12.89 17.23 20.12 18.36
SmokinCaterpillar/pypet 277 35 35 19.1 18.36 16.61 15.72 8.63 16.16 19.37 20.85
Average 17.50 17.29 18.59 17.25 14.28 20.32 22.28 24.28
Table 6. Effectiveness of same-project fine-tuning for code summarization task on category I Python projects
Observation 3. Further same-project fine-tuning with 150+ samples helps the “PolyGlot” model beat all other prior conventionally fine-tuned models by a substantial margin, on average.

4.2. Effectiveness of same-project training on Category II projects

Table 7 presents the results of the Category II projects (with 100-150 training samples) for Java. We measure 17.72, 17.83, 19.18, 18.93 BLEU-4 on average with GraphCodeBERT, GCBhybrid, CodeT5, and ”PolyGlot” models on the Java dataset in the cross-project setting. The performance of the cross-project setting is consistent with the results we observe with Category I projects. However, CodeT5 underperforms with same-project training and scores only 5.33 BLEU-4 on average. In section 2, we found that CodeT5 generally performs worse with less than 150 training samples. On the other hand, our decoder-enhanced model GCBhybrid outperforms all the cross-project models and achieves 21.34 BLEU -4, which is 2.16 higher than the best performing cross-models. The ”PolyGlot” model surpasses even this model, reaching 23.39 BLEU, over 4.2 BLEU-4 better than the best conventionally fine-tuned (cross-project) model. We find similar performance with Python also (Table 8)

Observation 4. With 100-150 same-projects samples, GCBhybrid outperforms all the cross-project models. Again, ”PolyGlot” does best overall. However, CodeT5 underperforms in this sample-range.
Projects Number of Samples Cross-project Same-project
Training Validation Test GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
real-logic/aeron 128 16 16 20.54 19.58 18.54 17.84 11.4 23.22 13.54 27.07
boonproject/boon 123 14 15 20.14 21.83 25.16 22.00 16.94 20.7 1.87 25.72
Koekiebox-PTY-LTD/Fluid 100 13 13 24.46 21.83 22.51 21.37 20.1 39.23 10.89 31.6
lessthanoptimal/GeoRegression 120 15 16 16.91 13.18 16.35 17.72 8.75 14.17 3.64 19.00
tony19/logback-android 118 16 16 12.15 15.87 18.88 17.58 6.61 14.71 1.5 20.84
spring-projects/spring-security 132 12 13 12.12 14.73 13.64 17.04 5.53 16.01 0.54 16.12
Average 17.72 17.84 19.18 18.93 11.56 21.34 5.33 23.39
Table 7. Effectiveness of same-project fine-tuning for code summarization task on category II Java projects
Projects Number of Samples Cross-project Same-project
Training Validation Test GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
Nic30/hwt 124 15 16 9.64 15.81 14.18 16.93 5.79 12.8 4.36 17.05
vaexio/vaex 124 15 16 15.9 17.23 15.62 17.08 9.2 13.6 2.2 18.19
assemblerflow/flowcraft 113 14 15 14.01 14.03 14.39 18.56 9.04 14.24 1.1 25.42
funilrys/PyFunceble 104 13 13 16.63 25.82 22.97 27.05 19.11 31.06 5.67 38.65
pyca/pyopenssl 100 13 13 19.81 27.47 23.69 23.1 15.22 24.87 17.24 25.4
LionelAuroux/pyrser 102 11 12 16.33 16.91 17.53 13.75 7.23 13.98 3.22 20.79
OpenKMIP/PyKMIP 150 18 18 14.57 15.66 17.03 18.8 31.38 39.11 41.17 42.59
Average 15.27 18.99 17.92 19.32 13.85 21.38 10.71 26.87
Table 8. Effectiveness of same-project fine-tuning for code summarization task on category II Python projects

4.3. Effectiveness of same-project training on Category III projects

Foundation models are pre-trained with billions of tokens. However, they are not trained to do code summarization task. They need enough samples to fine-tune the objective of the models. Though GCBhybrid scored 11.37 BLEU-4 (Table 1) with only 10 samples, it is much lower than we usually achieve with Cross-project models. Therefore, same-project training has some limitations. It requires a certain number of samples to achieve comparable results to cross-project models. We found that the models need at least 100 samples to compete with cross-project models from Category II and Category III projects. Table 9 has 4 projects in total (the first two from Python and later ones from Java). It shows that the same-project models are underperforming with less than 100 samples and achieve 14.99 with GCBhybrid models. CodeT5 only achieves 2.84 BLEU-4, whereas all the cross-project models achieve more than 15.36 BLEU-4. However, even in this case, “PolyGlot” dominates after same-project fine-tuning, reaching 19.16; however, in this setting, the improvement falls slightly shot of the 2 BLEU-4 threshold; however, it is important to note that this approach could be used with other approaches to gain stronger improvements.

Projects Number of Samples Cross-project Same-project
Training Validation Test GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
GraphCodeBERT GCBhybrid CodeT5
PolyGlot
GraphCodeBERT
singularityhub/sregistry-cli 98 10 11 10.98 12.64 14.6 12.68 9.35 11.36 2.51 11.2
deepmipt/DeepPavlov 89 14 14 15.27 16.8 16.72 18.39 8.4 20.79 5.69 26.5
apache/parquet-mr 84 10 11 18.01 17.51 19.5 21.83 10.16 14.15 1.2 21.57
wro4j/wro4j 94 13 14 17.12 15.92 17.99 16.42 7.76 13.67 1.96 17.35
Average 15.35 15.72 17.20 17.33 8.92 14.99 2.84 19.16
Table 9. Effectiveness of same-project fine-tuning for code summarization task on category III projects
Observation 5. Same-project fine-tuning does help “PolyGlot” model dominate again; however, by itself it provides only modest gains with less than 100 samples, and may have to be used together with other refinements

5. Discussion

In this section, we will discuss the feasibility of applying same-project training and the computational cost needed for such training. We will also present one motivating example at the end of this section.

5.1. Feasibility of same-project training

Refer to caption
Figure 4. Complete lifespan and time needed to generate required number of samples.

So far, we have discussed the possible benefits of same-project training; higher BLEU-4 scores can be attained with just 100 samples. Our experiments suggest that an already extensively fine-tuned model like 𝒫olyglot{\mathcal{P}olyglot}GraphCodeBERT can still benefit from a few epochs same-project fine-tuning.

However, how long must we wait, after a project starts, to get 100 samples? This matters, because if it takes a long time to get enough samples, the benefit of same-project training is perhaps lower; one might just use cross-project training from older/existing projects. We look into the project-wise lifespans, and time it takes to net 100 fine-tuning samples for our Category I and II projects. We track this data for both Python and Java, for each project in the CodeSearchNet data, from project inception to the date the data was collected. Many of these projects are still active

Table 4 shows the distribution of the project time spans, in days, for both Java and Python projects. We show both the Complete lifespan of the project (until last activity, or the current time, if still active and the time elapsed until 100 fine-tuning samples are available. It’s evident that Java projects “live” longer than Python projects in our dataset. The median lifespan for java projects 2872 days (almost 8 years); however, the time required to generate 100 fine-tuning samples is 335 days (less than 1 year). That means for nearly 7 years (85% of the total lifespan), these projects could benefit from same-project fine-tuning. We have observed a similar situation for Python also. The median lifespan for Python projects 1365 days (almost 4 years), and the time required to generate 100 comments is 496 days (less than 1.5 years). Python projects could also benefit from same-project training for 64% of their lifespan. Therefore, we can assume that the sample sizes sufficient for same-project training become available reasonably early and can be used for the remaining lifespan of the project.

5.2. Computational cost of same-project training

Same-project training is computationally much cheaper! In fact, in our experiments same-project training run is persisted for three times as many epochs as the cross-project run; we do this get better convergence with fewer samples; even so, we see significant gains in fine-tuning costs. In the cross-project setting, we have 164,923 and 251,820 samples for Java and Python, respectively. To compare with the same-project setting, we choose the projects with the largest in-project fine-tuning datasets: 1341 for Java, and 435 for Python. Even these big projects, same-project training takes just 2.43% (Java) and 0.52% (Python) of the time of cross-project training. Considering all the Java and Python Category I and II projects, the total training samples will be 5,122 and 3,004, across all projects.. Cumulatively, for all projects, same-project training takes just 9.31% and 3.57% of the cross-project training time for Java and Python, respectively. Note that we compare the computational efficiency with respect to sample counts instead of time because the time required for training a certain number of samples varies due to server load.

5.3. Motivating Example

Now we present one illustrative example where our same-project polyglot GraphCodeBERT model outperforms all the cross-project and same-project models. Table 10 presents the results from different models for the examples presented in Figure 5. We can see cross-project GraphCodeBERT, GCBhybrid, and polyglot GraphCodeBERT produce meaningful summaries for example I. Same-project GCBhybrid also generates a complete sentence (0.36 BLEU-4). However, our polyglot GraphCodeBERT fine-tuned with the same project data gives the closest summary achieving 0.49 BLEU-4.

@Override
public void begin(InterpretationContext ic, String name,
Attributes attributes) throws ActionException {
hook = null;
inError = false;
String className = attributes.getValue(CLASS_ATTRIBUTE);
if (OptionHelper.isEmpty(className)) {
className = DefaultShutdownHook.class.getName();
addInfo(”AssumingclassName[” + className + ”]”);
}
try {
addInfo(”Abouttoinstantiateshutdownhookoftype[” +
className + ”]”);
hook = (ShutdownHookBase) OptionHelper.instantiateByClassName
(className,ShutdownHookBase.class, context);
hook.setContext(context);
ic.pushObject(hook);
}catch (Exception e) {
inError = true;
addError(”Couldnotcreateashutdownhookoftype[” +
className + ”].”, e);
throw new ActionException(e);
}
}

Figure 5. Code for motivating example
Original
Cross-project or
same-project?
Model name Prediction BLEU-4
Instantiates a shutdown hook of the given class and sets its name . Cross-project GraphCodeBERT Initialize the shutdown hook . 0.11
GCBhybrid Initialize the shutdown hook . 0.11
CodeT5
This method is called at the
beginning of the action .
0.13
PolyGlot
GraphCodeBERT
Parses a shutdown hook . 0.14
Same-project GraphCodeBERT Get the the the . 0.07
GCBhybrid
Instantiate a hook of
the given class .
0.36
CodeT5
) { addInfo ( ”Aboutto instantiate
shutdown hook oftype [” + …
more random tokens
0.02
PolyGlot
GraphCodeBERT
Instantiates a new shutdown
hook of the given class .
0.49
Table 10. Predictions for the example presented in Fig. 5

6. Related Work

Code summarization Code summarization is a widely studied problem in software engineering. Developers spend around 59% of their time on activities somewhat relevant to program comprehension (Xia et al., 2017), and good comments can ease the development and maintenance process by helping developers more quickly understand the meaning of code under maintenance (Sridhara et al., 2010). However, misaligned and outdated comments are prevalent in SE projects. Automatic code summarization can help provide more faithful & current comments. Code summarization can also help write new comments.

We can closely relate the code summarization task to Neural Machine Translation (NMT) (e.g., translating English to German). In NMT, an encoder-decoder-based framework is used to do the translation task. Reachers in the SE domain have also adopted such a framework for code summarization tasks. Systems like CodeNN (Iyer et al., 2016), DeepCom (Hu et al., 2018), Astattgru (LeClair et al., 2019), Rencos (Zhang et al., 2020), NCS (Ahmad et al., 2020) and many more applied different kinds of deep learning architecture (e.g., LSTM (Sutskever et al., 2014) and Transformers (Vaswani et al., 2017)) on encoder-decoder framework and show good performance on code summarization task. Prior work (Roy et al., 2021; Shia et al., 2022; Gros et al., 2020) discuss the evaluation metrics and datasets that have been used for code summarization task.

Foundation models for code summarization Foundation models (Feng et al., 2020; Ahmad et al., 2021; Qi et al., 2021; Phan et al., 2021; Liu et al., 2019; Mastropaolo et al., 2021; Wang et al., 2021) are currently state-of-the-art for the code summarization task. Pre-training is key for foundation models, and helps them learn the language’s statistical properties well. Since the model already knows about the language, a few examples are enough to train the model for a downstream task like code summarization. In this paper, we show that the pre-trained models are indeed sample-efficient and can outperform the models trained with cross-project data. Note that there are more than 30 papers that have been published in the last five years that follow some form of encoder-decoder architecture for code summarization (Roy et al., 2021). Comparing each model is beyond the scope of this paper. We primarily emphasize same-project, sample efficient fine-tuning that can be applicable to a wide range of models. Ahmed and Devanbu discuss augmenting the dataset using multilingual training and help models perform better. In contrast, we propose to reduce the sample count and perform well using same-project data. Autoregressive generative language models, such as GPT-3 (Brown et al., 2020) have shown strong on-task performance, even after very limited fine-tuning; however, in our setting, without custom pre-training, as was done for GraphCodeBERT, and GCBhybrid, it’s difficult to ensure that the (enormously sized) pre-training data used in these models was not already pre-polluted with the data we use for same-project testing; these enormously sized models are too costly to pre-train, except for the wealthiest organizations, so we omit these from our evaluation.

7. Threats

The main threats to our results arise from our evaluation approach. We explore some of them below

Data Pollution

For external validity, and stability of results, it is important ensure that we never test on data we used for pre-training or fine-tuning. The CodeXglue dataset for code summarization is split very carefully to avoid risk of data pollution; the pre-training data is separate from the fine-tuning data, and the test data is distinct from both of these. Our evaluation of the GraphCodeBERT and GCBhybrid models adheres to this protocol. The “polyglot” model is first fine-tuned on a large, multilingual dataset, of around a million samples, from CodeXGlue, before project-specific fine-tuning and same-project evaluation, on held-out set of projects.

Data Duplication

As described by Allamanis (Allamanis, 2019), duplication lead to poor estimates of performance that don’t generalize. Fortunately CodeXglue (Lu et al., 2021) is very carefully de-duplicated, and thus the performance numbers we report here can be expected to be fairly good.

External Validity

External validity threats may arise from size of samples used to estimate performance, as well as whether the representativeness of the samples. First, our results have statistical significance: we have compare the performance of our best model (“polyglot” model) with the SOTA (CodeT5) using a paired, non-parametric 2-sample test, and can reject the null hypothesis that SOTA is the same or better than our best model. Second, our average improvement on the BLEU-4 score is well above the 2 BLEU-4 threshold reported to be the barrier for humans to detect.

On the other hand, we have tested on a total of 18 Java projects and 16 Python projects. In almost every setting our best model beats CodeT5; but in some cases, it does not. Therefore, some caution is warranted in assessing the external validity of our results.

8. Conclusion

The existence, and impact, of Project-specific phenomena in software projects has been known for quite a while. The advent of foundation models, which can be fine-tuned on-task, offers a possible direction to exploit project-specificity for better on-task performance. We explore this direction, for the code summarization task, with several popular foundation models. We find that same-project training helps models exhibit competitive performance in several settings. In particular, we develop a new kind of GraphCodeBERT model, named GCBhybrid, which combines GraphCodeBERT with a specially trained decoder. GCBhybrid exhibits very high sample-efficiency, which further enables exploitation of project specificity; except “polyglot”, GCBhybrid does achieve state-of-the-art in some realistic same-project time-series settings. We also find that same-project training offers substantial savings in computational cost. In addition to code summarization, project-specific fine-tuning is a general idea that could well prove an useful adjunct for other tasks, such as defect prediction, fault localization, de-obfuscation, or automated patching. Finally, the same-project code summarization dataset and GCBhybrid source code are made available anonymously at https://doi.org/10.5281/zenodo.6523229.

References

  • (1)
  • Ahmad et al. (2021) Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2655–2668. https://www.aclweb.org/anthology/2021.naacl-main.211
  • Ahmad et al. (2020) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL).
  • Ahmed and Devanbu (2022) Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual training for Software Engineering. In Proceedings, ICSE.
  • Allamanis (2019) Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153.
  • Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258 (2021).
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Chen et al. (2022) Fuxiang Chen, Fatemeh Fard, David Lo, and Timofey Bryksin. 2022. On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages. arXiv preprint arXiv:2204.09653 (2022).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1536–1547.
  • Gros et al. (2020) David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to Comment ?Translation?: Data, Metrics, Baselining & Evaluation. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 746–757.
  • Guo et al. (2020) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, LIU Shujie, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations.
  • Hellendoorn and Devanbu (2017) Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 763–773.
  • Hindle et al. (2012) Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In2012 34th International Conference on Software Engineering (ICSE).
  • Hu et al. (2018) Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC). IEEE, 200–20010.
  • Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  • Iyer et al. (2016) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083.
  • Kanade et al. (2020) Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and evaluating contextual embedding of source code. In International Conference on Machine Learning. PMLR, 5110–5121.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
  • LeClair et al. (2019) Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795–806.
  • Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  • Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 501–507.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR abs/2102.04664 (2021).
  • Mastropaolo et al. (2021) Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336–347.
  • Phan et al. (2021) Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. 2021. CoTexT: Multi-task Learning with Code-Text Transformer. arXiv preprint arXiv:2105.08645 (2021).
  • Qi et al. (2021) Weizhen Qi, Yeyun Gong, Yu Yan, Can Xu, Bolun Yao, Bartuer Zhou, Biao Cheng, Daxin Jiang, Jiusheng Chen, Ruofei Zhang, et al. 2021. ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation. arXiv preprint arXiv:2104.08006 (2021).
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  • Roy et al. (2021) Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Reassessing automatic evaluation metrics for code summarization tasks. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1105–1116.
  • Shia et al. (2022) Ensheng Shia, Yanlin Wangb, Lun Dub, Junjie Chenc, Shi Hanb, Hongyu Zhangd, Dongmei Zhangb, and Hongbin Suna. 2022. On the Evaluation of Neural Code Summarization. ICSE.
  • Sridhara et al. (2010) Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 43–52.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.
  • Tu et al. (2014) Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 269–280.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
  • Xia et al. (2017) Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shanping Li. 2017. Measuring program comprehension: A large-scale field study with professionals. IEEE Transactions on Software Engineering 44, 10 (2017), 951–976.
  • Zhang et al. (2020) Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020. Retrieval-based neural source code summarization. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 1385–1397.
  • Zimmermann et al. (2009) Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan Murphy. 2009. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. 91–100.