Code Comment Inconsistency Detection with BERT and Longformer

Theo Steiner
UNC Chapel Hill
[email protected]
\AndRui Zhang
Penn State University
[email protected]

Abstract

Comments, or natural language descriptions of source code, are standard practice among software developers. By communicating important aspects of the code such as functionality and usage, comments help with software project maintenance. However, when the code is modified without an accompanying correction to the comment, an inconsistency between the comment and code can arise, which opens up the possibility for developer confusion and bugs. In this paper, we propose two models based on BERT (Devlin et al., 2019) and Longformer (Beltagy et al., 2020) to detect such inconsistencies in a natural language inference (NLI) context. Through an evaluation on a previously established corpus of comment-method pairs both during and after code changes, we demonstrate that our models outperform multiple baselines and yield comparable results to the state-of-the-art models that exclude linguistic and lexical features. We further discuss ideas for future research in using pretrained language models for both inconsistency detection and automatic comment updating.

1 Introduction

Natural language comments embedded within source code are indispensable to successful software projects. Although they are not compiled, comments serve as a guide to developers by providing valuable information about the functionality the source code implements as well as how developers are meant to utilize the code (Tan et al., 2012). Comments additionally help reduce ramping-up time, which is the time required for new developers to get accustomed to the specifics of a software project. Moreover, since developers on average spend around 82% of their time on program comprehension and navigation (Xia et al., 2018), maintaining consistent comments can increase developer productivity. As such, it is essential for developers to keep comments up to date with the code so that their benefits can be capitalized on. In practice, however, comments are not always updated alongside code changes (Wen et al., 2019), which can lead to them becoming outdated and inconsistent with the code. Figure 1 shows an example of an inconsistent comment versus a consistent comment. If not addressed in a timely manner, these inconsistencies can linger in code bases and mislead developers, becoming sources of confusion, bugs, and an overall impaired software development cycle (Wen et al., 2019; Jiang and Hassan, 2006; Tan et al., 2007; Ibrahim et al., 2012). Code comment inconsistency detection is therefore of immense practical use to software developers who have a vested interest in keeping their code bases easily readable, navigable, and as bug-free as possible.

Prior work focuses on rule-based techniques for specific types of comments such as references to renamed identifiers (Ratol and Robillard, 2017), API directives pertaining to parameter constraints (Zhou et al., 2017), and null values and exceptions (Tan et al., 2012). Others have attempted text similarity methods (Rabbi and Siddik, 2020; Corazza et al., 2018). However, these previous approaches only apply to pre-existing inconsistencies within a software project, which is not helpful for developers who would benefit more from a comprehensive, real-time comment maintenance system.

To resolve this dilemma, Panthaplackel et al. (2021) develop a state-of-the-art deep learning approach that not only detects existing inconsistencies (referred to as post hoc) but also flags inconsistencies that appear immediately as a result of code changes, before they can cause harm to the code base (referred to as just-in-time). A comment update model, designed to aid developers in fixing discrepancies by suggesting updates to those comments that are labeled as inconsistent, is also presented within an extrinsic evaluation (Panthaplackel et al., 2021).

In this paper, we alternatively propose a much more straightforward architecture and methodology. Our approach leverages existing large, pretrained models, namely BERT (Devlin et al., 2019) and Longformer (Beltagy et al., 2020) to address the problem of code comment inconsistency detection as a natural language inference (NLI) task.¹¹1Implementation is available at https://github.com/theo2023/coco-bert-longformer. It is well known that such models have previously demonstrated high performance on many downstream natural language processing (NLP) tasks, including NLI. To compare with the state-of-the-art, we evaluate our models on the corpus provided by Panthaplackel et al. (2021) and show that in the post hoc setting, our approach outperforms the baselines by substantial margins but underperforms the state-of-the-art in the just-in-time setting.

Our main contributions are that we (1) formulate code comment inconsistency detection as an NLI problem via binary classification; (2) show that large, pretrained language models can provide significant improvements in performance in the post hoc setting and comparable results to the state-of-the-art (without the addition of explicit features) in the just-in-time setting; and (3) offer proposals for future work in inconsistency detection and comment updating based on our approach.

2 Related Work

Rabbi and Siddik (2020) introduce a siamese recurrent neural network architecture that detects inconsistencies by measuring the similarity between the comment and the code. They employ two separate LSTM (Hochreiter and Schmidhuber, 1997) models, one for the comment and the other for the code, and compute the Euclidean norm distance between the final hidden vectors of each LSTM. If the similarity score is below a certain predefined threshold, the comment is classified as inconsistent. While this technique achieves high recall on a limited set of Java code-comment pairs, it does not consider changes between different versions of the code and thus is incompatible with the just-in-time setting. Additionally, in general, text similarity is only a relevant metric for comments that summarize their code methods, not comments that document more specific aspects of the code (e.g., @return and @param Javadoc comments) (Panthaplackel et al., 2021).

Liu et al. (2018) develop several random forest classifiers and use 64 features relating to the code before and after modification, the comment, and their relationship. The features include code refactorings, length of the comment, and code-comment similarity, among others, in order to detect inconsistent block and line comments just-in-time. We avoid such extensive feature engineering in our procedure and also consider Javadoc comments instead of the more low-level block/line comments.

Instead of a model based on code-comment similiarity or thorough feature extraction, Panthaplackel et al. (2021) learn the relationship between comments and code modifications through a more intricate architecture that features several BiGRU (Cho et al., 2014) encoders and multi-head attention (Vaswani et al., 2017) layers. This architecture is explained in more detail in Section 5.1. Rather than using external attention modules, we rely on the built-in attention mechanisms of BERT and Longformer to correlate the comment and code tokens.

Automatic comment updating has also been of interest in the recent literature. This is the more challenging problem of resolving inconsistencies once they have been detected. Both Panthaplackel et al. (2020) and Zhu et al. (2022) develop an encoder-decoder architecture that examines changes in the source code and generates an edit action sequence specifying how the comment should be updated, which is then parsed into the gold comment. Although we focus on inconsistency detection and do not provide a comment update model in this work, we encourage further research in this important area in Section 8.

3 Task

The task of code comment inconsistency detection is presented as follows: given a comment $C$ and its corresponding code method $M$ , determine whether $C$ is inconsistent with $M$ . We accomplish this task in two settings, post hoc and just-in-time, and formulate it in terms of NLI.

3.1 Post hoc

In the post hoc setting, only the existing version of the comment-method pair is available such that inconsistencies are already present in the code base. Modifications to the code are not considered.

3.2 Just-in-time

In the just-in-time setting, the idea is that modifications to the code must be analyzed in order to determine which code changes caused the comment to become inconsistent, and to detect this inconsistency before it materializes and is committed to the repository. Thus, both versions of the code method $M_{old}$ and $M$ (before and after modification by the developer) are available, as well as $M_{edit}$ , an edit action sequence representing the changes between $M_{old}$ and $M$ . An example of $M_{edit}$ is shown in Figure 2.

3.3 Natural Language Inference

We view code comment inconsistency detection as an NLI task, also known as textual entailment. Given a premise and hypothesis, the objective of this common NLP task is to determine whether the hypothesis entails the premise, contradicts the premise, or is neutral.²²2https://paperswithcode.com/task/natural-language-inference For this specific problem, the premise is the comment $C$ and the hypothesis is the code method $M$ . If our models predict that $M$ entails $C$ , then $C$ is labeled as consistent. Otherwise, if our models predict that $M$ contradicts $C$ , then $C$ is labeled as inconsistent. Note that we slightly modify NLI and drop the neutral label entirely, simplifying inconsistency detection to binary classification. Also note that in our initial problem formulation, the premise was $M$ and the hypothesis was $C$ , but for practical reasons described in Section 5.2, we swap the order of $M$ and $C$ .

4 Data

We use the training, validation, and test data curated by Panthaplackel et al. (2021).³³3https://github.com/panthap2/deep-jit-inconsistency-detection This data consists of @return, @param, and summary Javadoc comments paired with their corresponding Java methods, totaling 40,688 examples (see Table 1). They consider comment-method pairs $(C_{1},M_{1}),(C_{2},M_{2})$ from consecutive commits of popular open-source Java projects and set $C=C_{1}$ , $M_{old}=M_{1}$ , and $M=M_{2}$ . Only examples in which $M_{1}\neq M_{2}$ , i.e., where the developer modified the code method, are extracted. The data further includes a sequence of span diff subtokens for each method that we use as $M_{edit}$ in the just-in-time setting.

	Train	Valid	Test	Total
`@return`	15,950	1,790	1,840	19,580
`@param`	8,640	932	1,038	10,610
Summary	8,398	1,034	1,066	10,498
Full	32,988	3,756	3,944	40,688
Projects	829	332	357	1,518

Table 1: Statistics of the full dataset (Panthaplackel et al., 2021).

Collected examples are given a positive label, i.e., $C_{1}$ is inconsistent with $M_{2}$ , if $C_{1}\neq C_{2}$ . The assumption behind this decision is that the developer updated the comment when they modified the code because the comment would have become inconsistent otherwise. On the other hand, collected examples are given a negative label, i.e., $C_{1}$ is consistent with $M_{2}$ , if $C_{1}=C_{2}$ . The assumption behind this decision is that the developer did not need to update the comment when they modified the code because the comment remained consistent (Panthaplackel et al., 2021).

However, this data collection procedure introduces noise in the form of mislabeled examples (Panthaplackel et al., 2021). A positive mislabeling occurs if the developer simply improved the comment without changing its semantics; hence, $C_{1}\neq C_{2}$ even though the comment is actually consistent. A negative mislabeling occurs if the developer neglected to update the comment alongside the code changes; hence, $C_{1}=C_{2}$ even though the comment is actually inconsistent. To reduce this noise, Panthaplackel et al. (2021) only consider the 1,000 best maintained Java projects and curate a smaller clean test sample separate from the full test set. As these clean examples were not provided, however, we do not report results from that data and only evaluate on the full test set.

Lastly, comment, method, and edit action sequences are tokenized via the subword tokenization algorithms of BERT and Longformer, i.e., WordPiece (Schuster and Nakajima, 2012) and byte pair encoding (Sennrich et al., 2016) respectively.

5 Models

We first discuss the baselines then introduce our models.

5.1 Baselines

•

Corazza et al. (2018): This is a post hoc SVM approach with a TF-IDF schema that learns the decision boundary between consistent and inconsistent comment-method pairs.
•

CodeBERT BOW: As implemented by Panthaplackel et al. (2021), this bag-of-words baseline is the same underlying model for each setting. It is an application of CodeBERT (Feng et al., 2020) in which the average embedding vectors of $C$ and $M$ (or $M_{edit}$ ) are concatenated and sent through a feedforward neural network.
•

Liu et al. (2018): Originally involving block and line comments that accompany code snippets, this is a just-in-time approach re-implemented by Panthaplackel et al. (2021). It uses random forest classifiers with code, comment, and relationship features to predict inconsistent comments based on code changes.
•

Panthaplackel et al. (2021): There are nine total models. In the architecture, the code encoder (Figure 3 (3)) is one of two distinct configurations. The first is a sequence encoder, a BiGRU that encodes $M_{edit}$ (or $M$ in the post hoc setting). The $\textsc{Seq}(C,*)$ models correspond to the use of this encoder.

Figure 3: Panthaplackel et al. (2021) architecture.

Figure 4: AST-based representation of $M_{edit}$ ( $T_{edit}$ ) (Panthaplackel et al., 2021). Removed nodes are in red and added nodes are in green.

The second configuration is an abstract syntax tree (AST) encoder, a gated graph neural network (GGNN) (Li et al., 2016) that encodes $T_{edit}$ (or $T$ in the post hoc setting), which is an AST representation of $M_{edit}$ . Figures 2 and 4 show examples of $M_{edit}$ and $T_{edit}$ respectively. The $\textsc{Graph}(C,*)$ models correspond to the use of this encoder. The last three models, denoted $\textsc{Hybrid}(C,*,*)$ , incorporate both the GRU and GGNN when encoding the code method.

Once the code and comment encoders encode their input as a sequence of tokens, integrating bidirectional context of other input tokens in the process, multi-head self-attention (Figure 3 (2)) and multi-head cross-attention (Figure 3 (4)) are each computed. The former updates the comment representation to include more context and help capture long-range dependencies, while the latter is employed between each hidden state of the comment encoder and the hidden states of the code encoder to associate the comment with changes in the code. Then, the contextualized embeddings from each attention module are combined and encoded by another BiGRU (Figure 3 (5)), at which point the comment encoding carries context from both the code method and other tokens in the comment. Finally, this output is fed through a fully connected layer and a softmax operation (Figure 3 (6)) to predict whether the comment is inconsistent with the code method (Figure 3 (7)).

5.2 Our Models

Our first model is BERT (Devlin et al., 2019) finetuned on an NLI task. In other words, it is the base BERT model with an attached sequence classification head. Since BERT allows a maximum sequence length of 512 tokens, the concatenated sequences are truncated if they exceed that length. This technique is not ideal because potentially valuable information from tokens in $M$ (or $M_{edit}$ ) that the classifier may need to make a correct prediction is lost.

To better accommodate longer sequences, we propose our second model, Longformer (Beltagy et al., 2020) finetuned on an NLI task. In other words, it is the base Longformer model with an attached sequence classification head. The eightfold increase in maximum sequence length (from 512 to 4096) provided by Longformer helps to reduce the adverse effects of truncation that our BERT model encounters. However, we believe the ability to process every single comment-method pair in the dataset does not justify the time and memory costs that would be incurred if we had chosen to leverage the default Longformer maximum sequence length of 4096. Therefore, the statistics in Table 2 motivate a choice of 1024 as the custom maximum sequence length for our Longformer model. As in the first model, the concatenated sequences are truncated if they exceed this length, and lower performance is the tradeoff.

The major advantage of Longformer is its linear attention pattern, addressing the fact that the time and memory complexity of vanilla self-attention (Vaswani et al., 2017) scales quadratically with sequence length (Beltagy et al., 2020). This improved attention mechanism makes Longformer especially efficient for longer sequences.

	`@return`	`@param`	Summary	Full
$C$	34.0	32.0	35.0	34.0
$M_{old}$	728.0	936.8	753.1	797.0
$M$	726.0	942.8	750.1	790.0
$M_{edit}$	895.1	1102.6	955.3	981.0

Table 2: 98th percentile of the comment and code sequence lengths.

	`@return`	`@param`	Summary	Full
$C$	9.7	8.4	13.3	10.3
$M_{old}$	131.1	186.9	137.0	147.2
$M$	131.9	187.7	135.4	147.3
$M_{edit}$	179.4	240.9	186.6	197.3

Table 3: Average lengths of the comment and code sequences (Panthaplackel et al., 2021).

As input to our models, the order $(C,M)$ was chosen because preliminary experiments revealed that $(M,C)$ did not lead to a significant change in results. We also found that with the $(M,C)$ ordering, there were cases in which all the tokens in $C$ were being truncated. Swapping the order of the inputs aligns with the fact that comments are shorter than their corresponding code methods on average (see Table 3), so in the worst case, the end of $M$ is truncated rather than the entirety of $C$ .

6 Experiments

Model	Precision	Recall	F1	Accuracy
Corazza et al. (2018)	63.7	47.8	54.6	60.3
CodeBERT BOW $(C,M)$	68.9	73.2	70.7	69.8
$\textsc{Seq}(C,M)$	60.6	73.4	66.3	62.8
$\textsc{Graph}(C,T)$	62.6	72.6	67.2	64.6
$\textsc{Hybrid}(C,M,T)$	56.3	80.8	66.3	58.9
$\textsc{Bert}(C,M)$	72.1	71.9	72.0	72.1
$\textsc{Longformer}(C,M)$	92.7	81.0	86.4	87.3

Table 4: Results for post hoc models compared with baselines. Scores are averaged across three different seeds.

Model	Precision	Recall	F1	Accuracy
CodeBERT BOW $(C,M_{edit})$	67.4	76.8	71.6	69.6
Liu et al. (2018)	77.5	63.8	70.0	72.6
$\textsc{Seq}(C,M_{edit})$	80.7	73.8	77.1	78.0
$\textsc{Graph}(C,T_{edit})$	79.8	74.4	76.9	77.6
$\textsc{Hybrid}(C,M_{edit},T_{edit})$	80.9	74.7	77.7	78.5
$\textsc{Seq}(C,M_{edit})$ + features	88.4	73.2	80.0	81.8
$\textsc{Graph}(C,T_{edit})$ + features	83.8	78.3	80.9	81.5
$\textsc{Hybrid}(C,M_{edit},T_{edit})$ + features	88.6	72.4	79.6	81.5
$\textsc{Bert}(C,M_{edit})$	76.4	78.3	77.3	77.1
$\textsc{Longformer}(C,M_{edit})$	80.3	76.5	78.3	78.8

Table 5: Results for just-in-time models compared with baselines. Scores are averaged across three different seeds.

6.1 Implementation Details

$C$ and $M$ (or $M_{edit}$ ) are tokenized separately and concatenated into one sequence of the form

[CLS] $C$ tokens [SEP] $M$ tokens [SEP],

where [CLS] is the classification token and [SEP] is the separation token. An attention mask is created as well as a binary mask representing the token type IDs so that BERT can differentiate between the part of the sequence that corresponds to $C$ and the part of the sequence that corresponds to $M$ (or $M_{edit}$ ). For the Longformer model, the token type IDs are unnecessary due to the use of the RoBERTa (Liu et al., 2019) tokenizer. As described in Section 5.2, the full sequence is truncated if it is longer than the maximum length permitted by the model. All other configurations, such as pretrained comment, code, and code edit embeddings, are the BERT and Longformer defaults.

6.2 Training

Because we are essentially training a binary classifier, we employ binary cross entropy (BCE) (Zhang and Sabuncu, 2018) as our loss function. We also use the Adam optimizer (Kingma and Ba, 2014) and consider $\{$ 0.0001, 0.00001 $\}$ as learning rates. To address memory issues, gradient accumulation is employed to achieve effective batch sizes of $\{$ 16, 32, 64 $\}$ . Training terminates if the validation F1 score does not improve for 10 epochs.

6.3 Software and Hardware

We implemented all models using PyTorch and the Hugging Face Transformers library,⁴⁴4https://huggingface.co/docs/transformers/index and we used scikit-learn to compute the evaluation metrics precision, recall, F1 score, and accuracy. Models were trained in a distributed fashion on 8 NVIDIA RTX A5000 GPUs (24 GB each).

7 Results and Analysis

We report precision, recall, F1, and accuracy for our post hoc and just-in-time models in Tables 4 and 5 respectively. In the post hoc setting, we find that both models outperform the baselines with respect to precision, F1, and accuracy, and $\textsc{Longformer}(C,M)$ achieves higher scores across all metrics. In the just-in-time setting, we find that the recall achieved by $\textsc{Bert}(C,M_{edit})$ of 78.3 matches the highest (from the baseline $\textsc{Graph}(C,T_{edit})$ + features); otherwise, our just-in-time models overall underperform the Panthaplackel et al. (2021) models that have added surface features like part-of-speech tags, comment/code overlap, etc. However, these models produce results comparable to the Panthaplackel et al. (2021) models without these features.

The results in Table 4 indicate that BERT and Longformer are clearly advantageous for post hoc inconsistency detection when finetuned on an NLI task. The substantial performance gains attained by our Longformer model are foreseeable given its linear attention pattern and ability to adequately process longer sequences. However, the decrease in performance from $\textsc{Longformer}(C,M)$ to $\textsc{Longformer}(C,M_{edit})$ in the just-in-time setting (Table 5) is puzzling, especially considering the increase in performance from $\textsc{Bert}(C,M)$ to $\textsc{Bert}(C,M_{edit})$ . The reason for this performance drop is unclear.

8 Future Work

Provided these results, we leave it to future work to both improve code comment inconsistency detection as an NLI task and apply our approach to a comment update model as mentioned in Sections 1 and 2. Accordingly, in addition to the just-in-time system flagging the inconsistent comment and notifying the developer that they should manually correct it, the high-level overview is that the code changes that caused the inconsistency would be examined and suggestions would be generated to help the developer rectify the outdated comment.

Certain variants of these pretrained models also show promise. For instance, GraphCodeBERT (Guo et al., 2021) is specifically designed for processing programming language and is pretrained on a large corpus of comment-code pairs. The main contribution of GraphCodeBERT is that it takes into account the semantic structure of code through its use of data flow. Furthermore, Longformer Encoder-Decoder (Beltagy et al., 2020) is a possibility for a comment update model since it is intended for sequence-to-sequence tasks with longer input.

Another option as an implementation detail is to handle longer sequences via chunking, i.e., splitting the input sequence into smaller chunks and processing them separately. We did not pursue this method due to time constraints, but we believe it is worthwhile to explore.

9 Conclusion

We formulated code comment inconsistency detection as a natural language inference task and proposed two models based on BERT and Longformer. By evaluating on a known corpus of comment-method pairs in both a post hoc and just-in-time manner, we demonstrated that our models outperform several baselines in the post hoc setting and yield comparable results to the state-of-the-art in the just-in-time setting without additional linguistic and lexical features. We further discussed some ideas for future research in using pretrained language models for inconsistency detection and an automatic comment updating system.

Acknowledgements

We thank the 2022 Machine Learning in Cybersecurity Research Experience for Undergraduates (REU) program at Penn State University, funded by National Science Foundation Grant 1950491, for supporting this work.

References

Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150.
Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
Corazza et al. (2018) Anna Corazza, Valerio Maggio, and Giuseppe Scanniello. 2018. Coherence of comments and method implementations: a dataset and an empirical investigation. Software Quality Journal, 26(2):751–777.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.
Guo et al. (2021) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training code representations with data flow. In International Conference on Learning Representations.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
Ibrahim et al. (2012) Walid M. Ibrahim, Nicolas Bettenburg, Bram Adams, and Ahmed E. Hassan. 2012. On the relationship between comment update practices and software bugs. J. Syst. Softw., 85(10):2293–2304.
Jiang and Hassan (2006) Zhen Ming Jiang and Ahmed E. Hassan. 2006. Examining the evolution of code comments in postgresql. In Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06, page 179–180, New York, NY, USA. Association for Computing Machinery.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated graph sequence neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu et al. (2018) Zhiyong Liu, Huanchao Chen, Xiangping Chen, Xiaonan Luo, and Fan Zhou. 2018. Automatic detection of outdated comments during code changes. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), volume 01, pages 154–163.
Panthaplackel et al. (2021) Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, and Raymond J. Mooney. 2021. Deep just-in-time inconsistency detection between comments and source code. In AAAI, page To appear.
Panthaplackel et al. (2020) Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond Mooney. 2020. Learning to update natural language comments based on code changes. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1853–1868, Online. Association for Computational Linguistics.
Rabbi and Siddik (2020) Fazle Rabbi and Md Saeed Siddik. 2020. Detecting Code Comment Inconsistency Using Siamese Recurrent Network, page 371–375. Association for Computing Machinery, New York, NY, USA.
Ratol and Robillard (2017) Inderjot Kaur Ratol and Martin P. Robillard. 2017. Detecting fragile comments. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 112–122.
Schuster and Nakajima (2012) Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Tan et al. (2007) Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /*icomment: Bugs or bad comments?*/. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, SOSP ’07, page 145–158, New York, NY, USA. Association for Computing Machinery.
Tan et al. (2012) Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T. Leavens. 2012. @tcomment: Testing javadoc comments to detect comment-code inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, pages 260–269.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wen et al. (2019) Fengcai Wen, Csaba Nagy, Gabriele Bavota, and Michele Lanza. 2019. A large-scale empirical study on code-comment inconsistencies. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 53–64.
Xia et al. (2018) Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E. Hassan, and Shanping Li. 2018. Measuring program comprehension: A large-scale field study with professionals. IEEE Transactions on Software Engineering, 44(10):951–976.
Zhang and Sabuncu (2018) Zhilu Zhang and Mert Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31.
Zhou et al. (2017) Yu Zhou, Ruihang Gu, Taolue Chen, Zhiqiu Huang, Sebastiano Panichella, and Harald Gall. 2017. Analyzing apis documentation and code to detect directive defects. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pages 27–37.
Zhu et al. (2022) Hongquan Zhu, Xincheng He, and Lei Xu. 2022. Hatcup: Hybrid analysis and attention based just-in-time comment updating. arXiv preprint arXiv:2205.00600.