Automatic Code Documentation Generation Using GPT-3

Junaed Younus Khan and Gias Uddin DISA Lab, University of Calgary

(2022)

Abstract.

Source code documentation is an important artifact for efficient software development. Code documentation could greatly benefit from automation since manual documentation is often labouring, resource and time-intensive. In this paper, we employed Codex for automatic code documentation creation. Codex is a GPT-3 based model pre-trained on both natural and programming languages. We find that Codex outperforms existing techniques even with basic settings like one-shot learning (i.e., providing only one example for training). Codex achieves an overall BLEU score of 20.6 for six different programming languages (11.2% improvement over earlier state-of-the-art techniques). Thus, Codex shows promise and warrants in-depth future studies for automatic code documentation generation to support diverse development tasks.

code documentation, GPT-3, Machine Learning.

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: 37th IEEE/ACM International Conference on Automated Software Engineering; October 10–14, 2022; Rochester, MI, USA^†^†booktitle: 37th IEEE/ACM International Conference on Automated Software Engineering (ASE ’22), October 10–14, 2022, Rochester, MI, USA^†^†price: 15.00^†^†doi: 10.1145/3551349.3559548^†^†isbn: 978-1-4503-9475-8/22/10^†^†ccs: Software and its engineering Documentation^†^†ccs: Computing methodologies Natural language generation

1. Introduction

In Software Engineering (SE), developers often try to figure out what a specific code unit (e.g., method) does and how to use it (Xia et al., 2017). They can do this by reading documentation of source code. Well-written documentation is crucial for effective software development (de Souza et al., 2005). The sudden shift to work from home during COVID-19 further showed the needs for good documentation for developers, when the subject matter expert of code base may be absent/unavailable (Uddin et al., 2022). However, such documentation is costly and time-consuming to create and maintain. Most developers often show unwillingness towards writing documentation as they find it less productive and less rewarding (Parnas, 2011; McBurney et al., 2017). As a result, manual documentation often becomes problematic or unusable (Robillard, 2009; Aghajani et al., 2019, 2020; Uddin et al., 2021). Moreover, documentation becomes obsolete over time with continuous modification or update to the system (i.e., code-base) (Ibrahim et al., 2012; Uddin and Robillard, 2015; Uddin et al., 2012).

Automated code documentation generation is currently attracting a lot of attention from the researcher community to substitute or complement the manual documentation efforts. Earlier works in this direction mostly focused on template-based (Sridhara et al., 2010; McBurney and McMillan, 2014; Moreno et al., 2013; Rai et al., 2017; Abid et al., 2015; Rastkar et al., 2011) and information retrieval (IR)-based strategies (Haiduc et al., 2010a, b; Eddy et al., 2013; Wong et al., 2013; Vassallo et al., 2014; Wong et al., 2015). Template-based approaches are limited to the underlying set of templates and so are finite/limited, while similarity measures in IR-based approaches can be erroneous (Rai et al., 2022). Recently, researchers are investigating several learning-based (e.g., deep learning) approaches for documentation generation. For example, CODE-NN, presented by Iyer et al., can generate documentation of C# and SQL code using LSTM attention networks (Iyer et al., 2016). Allamanis et al. also used an attention neural network for code summarization (Allamanis et al., 2016). Hu et al. developed DeepCom that produces code documentation using NLP techniques and by combining the lexical and structure information of source code (Hybrid-DeepCom) (Hu et al., 2018, 2020).

Recent success of pre-trained transformer models in several domains encouraged researchers to also utilize those for automated documentation generation. In fact, they are found to be the state-of the-art performers in this task. CodeBERT, a BERT-based model pre-trained on large scale natural and programming language data, showed great performance in automated documentation generation (Feng et al., 2020). Some other transformer-based pre-trained models i.e., PLBART (Ahmad et al., 2021), CoTexT (Phan et al., 2021) recently outperformed CodeBERT in the context of documentation generation.

Although pre-trained transformer models have shown promise in code documentation generation, we are aware of no study to evaluate the effectiveness of GPT-3 in this direction. GPT-3 is the third generation Generative Pre-trained Transformer model developed by OpenAI. GPT-3 has 175B parameters and is trained on very large-scale internet data (Brown et al., 2020). It has been found to achieve high performance in different classification and generation tasks (Floridi and Chiriatti, 2020; Dale, 2021; Chiu and Alexander, 2021; Chintagunta et al., 2021). Similar to other transformer models, GPT-3 architecture can also be used in different software engineering tasks that involve both natural and programming language understanding. In fact, OpenAI released a dedicated model, Codex for this task. Codex is a GPT-3 like model which is trained on large-scale GitHub data. Codex is trained on over a dozen of programming languages like Python, Java, PHP, JavaScript, and so on (Chen et al., 2021). The video demo of OpenAI shows that Codex could generate source code from a given requirement specified in Natural Language. Since then researchers have been investigating Codex for the automation of several SE tasks like code generation (Finnie-Ansley et al., 2022), code repair (Prenner and Robbes, 2021), security bug-fix (Pearce et al., 2021), simulation modeling (Jackson and Saenz, 2022). The official documentation of Codex mentions that it is also capable of automatic documentation generation. However, we are not aware of any systematic evaluation of Codex to produce code documentation.

In this paper, we conducted a preliminary case study to investigate the effectiveness of Codex in code documentation. We analyzed the code documentation automatically produced by Codex for six programming languages: Python, Java, PHP, GO, JavaScript, Go, and Ruby. To be specific, we evaluated Codex on CodeSearchNet dataset (Husain et al., 2019) and compared its performance with existing models (Feng et al., 2020; Ahmad et al., 2021; Phan et al., 2021; Parvez et al., 2021). Unlike other pre-trained transformer models, GPT-3 often performs well at different downstream tasks without any kind of re-training or fine-tuning. Instead, it is adapted to a task by one (or few)-shot learning where the model is provided with a task description and one (or few) example(s) of that task. In fact, simply providing the task description without any examples (zero-shot learning) often yields good results. In this study, we have experimented Codex with both zero and one-shot learning. We found that Codex with one-shot learning shows state-of-the-art overall performance in documentation generation outperforming all previous models with an average BLEU score of 20.63. As per language specific performance, it outperforms all the models in four out of six languages (i.e., Python, Ruby, JavaScript, GO) with good margin. For the other two languages i.e., Java and PHP, it becomes the second best performer, where it was slightly outperformed by REDCODER and CodeBERT, respectively. We also conducted several qualitative analyses of the Codex documentations in terms of Flesch-Kincaid Grade Level (readability), Documentation Length (quantity), and TF-IDF (informativeness). We found that the generated documentations are close to the actual ones based on these metrics. In fact, we observed that some Codex generated documentation might contain more comprehensible information than the actual ones. We found that even with very basic setup (one-shot learning), Codex is capable of state-of-the-art performance in this field. To the best of our knowledge, ours is the first systematic evaluation of GPT-3 Codex model in automated code documentation generation.

Replication Package. https://github.com/disa-lab/CodeDoc_GPT-3_ASE22

2. Related Work

Related work on automatic code documentation can be three types: Template, Information retrieval, and Learning-based.

Template-based uses an automatic tool to insert information according to pre-defined rules and layout. Sridhara et al. utilized natural language templates to capture the key statements from a Java method and build a method level summary (Sridhara et al., 2010). Rastkar et al. and Moreno et al. used heuristics to extract and summarize information from source code (Rastkar et al., 2011; Moreno et al., 2013). Mcburney et al. merged contextual information with method statements (McBurney and McMillan, 2014). Abid et al. produced summary for C++ methods by stereotyping them with their source code analysis framework (srcML) (Abid et al., 2015). Rai et al. summarized Java code that uses code level nano-patterns (Rai et al., 2017).

Information Retrieval approaches like latent semantic indexing (LSI) and vector space modeling (VSM) have been employed by Haiduc et al. to generate documentation for classes and methods (Haiduc et al., 2010a, b). Eddy et al. extended their work by exploiting a hierarchical topic model (Eddy et al., 2013). Wong et al. employed code clone detection to find similar code snippets from StackOverflow and automatically mined source code descriptions for comment generation (Wong et al., 2013, 2015).

Learning-based approaches mostly use deep-learning techniques to learn latent features from source-code. Iyer et al. proposed an LSTM-based network, CODE-NN, that was trained on Stack Overflow data to generate C# and SQL code summaries (Iyer et al., 2016). Allamanis et al. used an attention neural network that employs convolution to learn local, time-invariant features (Allamanis et al., 2016). Barone et al. built a dataset of Python functions and their docstrings using GitHub data and employed Neural Machine Translation (NMT) to generate docstrings from given functions (Barone and Sennrich, 2017). Wan et al. proposed a reinforcement learning framework that incorporates the abstract syntax tree (AST) along with the sequential code content (Wan et al., 2018). Hu et al. developed DeepCom and Hybrid-Deepcom to generate code comments by learning from a large corpus and by combining lexical and structural information of code (Hu et al., 2018, 2020). Chen and Zhou designed a neural framework called BVAE to improve code retrieval and summarization tasks (Chen and Zhou, 2018). Several studies employed transformer-based models, e.g., Ahmad et al. used self-attention based transformer model (Ahmad et al., 2020) while Wang et al. developed a BERT-based model called Fret (Wang et al., 2020). Feng et al. presented CodeBERT, pre-trained on 2.1M bimodal data points (codes and descriptions) and 6.4M unimodal code across six languages (Python, Java, JavaScript, PHP, Ruby, and Go) (Feng et al., 2020). They evaluated CodeBERT on two downstream NL-PL tasks (code search and documentation) by fine-tuning model parameters. Gao et al. proposed the concept of Code Structure Guided Transformer for source code summarization that incorporates code structural properties into transformer to improve performance (Gao et al., 2021). Phan et al. presented CoTexT that learns the representative context between natural & programming language and performs several downstream tasks including code summarization (Phan et al., 2021). Ahmad et al. developed PLBART, a sequence-to-sequence model, pre-trained on a large set of Java and Python functions and their textual descriptions collected from Github and Stack Overflow (Ahmad et al., 2021). Later, Parvez et al. showed that the addition of relevant codes/summaries retrieved from a database (such as GitHub and Stack Overflow) can improve the quality of documentation produced by a generator model (Parvez et al., 2021).

Though transformer-based models showed promise in documentation generation, GPT-3 model had not been systematically evaluated for this task yet in spite of its success and popularity. Our study employed GPT-3 based Codex model for documentation generation and compared its performance with existing approaches.

3. Experiment

In this section, we present the results of our preliminary investigation of the effectiveness of GPT-3 for source code documentation generation. A schematic overview of the major steps of our study is shown at Figure 1 and are described below.

Refer to caption — Figure 1. A schematic overview of our study

3.1. Dataset Collection and Preprocessing

To evaluate the effectiveness of our method, we used CodeSearchNet (Husain et al., 2019), a widely used dataset for different downstream SE tasks. This dataset is also included in CodeXGLUE (Lu et al., 2021), a machine learning benchmark dataset for code understanding and generation. It incorporates a large number of code and documentation pairs coming from six different languages i.e., Java, Python, PHP, GO, JavaScript, and Ruby. We applied the same data processing recommended by Feng et al. where they evaluated CodeBERT for documentation generation (Feng et al., 2020). We first removed comments from all the codes and then removed examples where i) the codes cannot be parsed into an abstract syntax tree, ii) the number of tokens in the documentation is less than 3 or greater than 256, iii) documentation contains special tokens such as $<$ img $>$ , https://, etc., iv) the language is not English. The statistics of the clean dataset are available at Table 1.

Table 1. Statistics of CodeSearchNet (Husain et al., 2019)

Language	Train	Valid	Test
Java	164,923	5,183	10,955
Python	251,820	13,914	14,918
PHP	241,241	12,982	14,014
GO	167,288	7,325	8,122
JavaScript	58,025	3,885	3,291
Ruby	24,927	1,400	1,261

3.2. GPT-3 Model and Parameter Setup

Model. Generative Pre-trained Transformer-3 (GPT-3) (Brown et al., 2020), developed by OpenAI, is an autoregressive language model that employs deep learning to generate human-like text. It has 175 billion parameters and 96 layers trained on $\sim$ 45 TB of text data coming from different web contents (such as Wikipedia). It has promising results in various kinds of NLP tasks e.g., question-answering, summarizing, translation. Not only that, GPT-3 has also shown great success in software engineering tasks like code generation from natural language command. Hence, we felt motivated to evaluate its effectiveness in code documentation generation as well. For our experiment, we used Codex, a descendant of GPT-3, that is trained on both natural language and billions of lines of public code from GitHub. Codex can understand various programming languages such as Python, Java, JavaScript, Go, Perl, PHP, Ruby, etc. Codex has already been used successfully to automate several SE tasks e.g., code generation (Finnie-Ansley et al., 2022), code repair (Prenner and Robbes, 2021), security bug-fix (Pearce et al., 2021).

Prompt Engineering. The interaction with GPT-3/Codex takes via prompt engineering, where a task description is provided as the input (prompt) and the model (GPT-3) performs the desired task (generates text) accordingly. There are several ways of prompting with GPT-3 models i.e., zero-shot, one-shot, few-shot learning. In zero-shot learning, the model is expected to generate an answer without providing any example. In fact, no additional information other than the task description itself is given in the prompt. On the other hand, one-shot and few-shot learning involve giving one (i.e., one-shot) or more than one (i.e., few-shot) examples in the prompt, respectively. In this study, we have experimented with zero-shot and one-shot learning. In zero-shot learning, we just tell the model to generate a documentation for a given source code in the prompt. In one-shot learning, we randomly select one sample (i.e., code-documentation pair) from the train set of the corresponding language (from CodeSearchNet), provide it in the prompt (as example) and then ask the model to generate a documentation for another source code by learning from the provided example of code-documentation pair. Figure 2 depicts the one-shot prompt format we used for documentation generation.

⬇

Code:

def add(x, y):

return x+y

Documentation: Adds two numbers.

Code:

def subtract(x, y):

return x-y

Documentation: [To be generated by Codex]

Figure 2. Sample prompt format for one-shot learning

Parameter Settings. There are a number of parameters involved with GPT-3 based models. One such parameter is Temperature that controls the randomness of the generated output (range 0 to 1). Another randomness parameter is Top-p (range 0 to 1) that controls how unlikely words can get removed from the sampling pool by choosing from the smallest possible set (of words) whose cumulative probability exceeds p. As recommended by Codex official documentation, we set the temperature at a low value (0.2) while keeping top-P at its default value (1.0). We also keep Frequency and Presence penalties at their default values (0.0) which control the level of word repetition in the generated text by penalizing them based on their existing frequency and presence. We use the max tokens size as 256 since our formatted dataset does not contain any larger ( $>$ 256) documentation (see Section 3.1).

3.3. Evaluation of Generated Documentation

Since GPT-3 models are currently subject to response limit and cost, we used a statistically significant sample size for this study instead of using the whole test sets for different languages. As depicted in Table 1, the largest test set in CodeSearchNet belongs to Python consisting of 14,918 samples and a statistically significant sample size from that would be 375 with 95% confidence interval and 5% error margin. However, we randomly selected 1000 ( $>$ 375) samples from each test set (i.e., total 6 sets of 1K samples for 6 languages) and evaluated Codex model on them. For evaluation, we used BLEU score (Papineni et al., 2002), a popular evaluation metric for machine-generated text that calculates the n-gram similarity of a generated and reference text. Since the generated documentation can be short at times and higher order n-gram might not overlap, we used smoothed BLEU score (Lin and Och, 2004) as recommended by prior works (Feng et al., 2020; Ahmad et al., 2021; Parvez et al., 2021).

3.3.1. Performance Analysis

Table 2 compares Codex’s performance with several SOTA models for documentation generation: Seq2Seq (Sutskever et al., 2014), Transformer (Vaswani et al., 2017), RoBERTa (Liu et al., 2019), CodeBERT (Feng et al., 2020), PLBART (Ahmad et al., 2021), CoTexT (Phan et al., 2021), REDCODER (Parvez et al., 2021). We observed that though Codex with zero-shot learning could not achieve satisfactory results (mostly because it fails to learn the expected documentation format), the performance greatly improves with one-shot learning. In fact, Codex (with one-shot) shows the best overall performance among all approaches with an average BLEU score of 20.63 while the nearest competitor CoTexT achieves 18.55 (11.21% improvement). In language specific performance, it significantly outperforms other models in all languages except two i.e., Java and PHP. In Java and PHP, Codex achieves BLEU scores of 22.81 and 25.13, which are slightly outperformed by REDCODER (22.95) and CodeBERT (25.16) respectively. Here, REDCODER is not an individual model on its own, rather it is a retrieval approach that can be used with other generative models to enhance their performances. In the original paper (Parvez et al., 2021), the authors used PLBART (Ahmad et al., 2021) as the base model for REDCODER where they retrieved relevant summaries from StackOverflow, GitHub, etc. and provided them with the input of PLBART to enhance its performance. Hence, it has some additional overhead (time and resource) while Codex is an all-in-all model itself. Moreover, Codex can perform even better if used with such additional retrieval approach (i.e., REDCODER). On the other hand, CodeBERT and most other reported models had been (re)trained or fine-tuned on task and language-specific datasets while Codex was provided with only zero or one example in our evaluation.

Table 2. Results on documentation generation (BLEU score)

Model	Ruby	JavaScript	GO	Python	Java	PHP	Overall
Seq2Seq (Sutskever et al., 2014)	9.64	10.21	13.98	15.93	15.09	21.08	14.32
Transformer (Vaswani et al., 2017)	11.18	11.59	16.38	15.81	16.26	22.12	15.56
RoBERTa (Liu et al., 2019)	11.17	11.90	17.72	18.14	16.47	24.02	16.57
CodeBERT (Feng et al., 2020)	12.16	14.90	18.07	19.06	17.65	25.16	17.83
PLBART (Ahmad et al., 2021)	14.11	15.56	18.91	19.30	18.45	23.58	18.32
CoTexT (2-CC) (Phan et al., 2021)	13.07	14.77	19.37	19.52	19.1	24.47	18.38
CoTexT (1-CC) (Phan et al., 2021)	14.02	14.96	18.86	19.73	19.06	24.58	18.55
REDCODER (Parvez et al., 2021)	-	-	-	21.01	22.94	-	N/A
REDCODER-EXT (Parvez et al., 2021)	-	-	-	20.91	22.95	-	N/A
Codex (0-shot)	5.41	9.83	15.80	18.93	13.59	13.32	12.81
Codex (1-shot)	16.04	16.58	20.94	22.28	22.81	25.13	20.63

3.3.2. Qualitative Analysis

As suggested by Schreck et al. (Schreck et al., 2007), we used Documentation Length and Flesch-Kincaid Grade Level (DuBay, 2004) to measure the Quantity and Readability of the generated documentation. We find that the average Flesch-Kincaid score of the Codex generated documentations is 5.97 with an average length of 8 words (per documentation) while the average Flesch-Kincaid score of the actual documentations is 6.77 with an average length of 12 words. Hence, the generated documentations are close to the actual ones in terms of quantity and readability. We further analyzed the informativeness of the generated documentation with respect to the actual ones using TF-IDF. To calculate the TF-IDF of a particular documentation, we add TF-IDF scores of all the words of the documentation (except stop-words). The average TF-IDF of Codex generated documentations is 1.94 while for actual documentations it is 2.28. Hence, the informativeness of the generated documentation is satisfactory in terms of comparative TF-IDF with the actual ones. We show some examples of Codex generated documentations in Figure 3. We see that in some cases, Codex can generate more comprehensible or additional information compared to the actual ones. In the example of PHP mentioned in Figure 3, Codex mentions “base-10 number” instead of “decimal” which is more understandable for users. In the example of Ruby, Codex provides an extra information “If the path is not present a noop is performed.” which is correct but was not present in the actual documentation.

⬇

Java

protected ImageIcon createImageIcon(String path, String description) {

java.net.URL imgURL = getClass().getResource(path);

if (imgURL != null) {

return new ImageIcon(imgURL, description);

} else {

System.err.println("Couldn’t find file: " + path);

return null;

}

Actual: Returns an ImageIcon, or null if the path was invalid.

Codex: Returns an ImageIcon, or null if the path was invalid.

⬇

Python

def check_for_file(self, file_path):

try:

files = self.connection.glob(file_path, details=False, invalidate_cache=True)

return len(files) == 1

except FileNotFoundError:

return False

Actual: Check if a file exists on Azure Data Lake.

Codex: Check if a file exists on the remote machine.

⬇

PHP

public static function base256ToDec($number)

{

Validate::canCastAsString($number, ’number’);

$result = 0;

$base = 1;

for ($i = strlen($number) - 1; $i >= 0; $i--) {

$result = bcadd($result, bcmul(ord($number[$i]), $base));

$base = bcmul($base, 256);

}

return $result;

}

Actual: Convert base 256 number to decimal number.

Codex: Converts a base 256 number to a base 10 number.

⬇

JavaScript

function validateProtocolOptions(protocolOptions) {

if (!protocolOptions) {

throw new TypeError(’protocolOptions not defined in options’);

}

const version = protocolOptions.maxVersion;

if (version && (typeof version !== ’number’ || !types.protocolVersion.isSupported(version))) {

throw new TypeError(util.format(’protocolOptions.maxVersion provided (%s) is invalid’, version));

}

Actual: Validates the protocol options.

Codex: Validate protocol options.

⬇

Ruby

def remove_file(*paths)

path = File.join(*paths)

log.debug(log_key) { "Removing file {path}’” }

FileUtils.rm_f(path)

path

end

Actual: Remove the file at the given path.

Codex: Removes the file located at the given path. If the path is not present a

noop is performed.

⬇

func (os *orderedSet) append(s string) {

if _, ok := os.included[s]; !ok {

os.list = append(os.list, s)

os.included[s] = struct{}{}

}

Actual: append adds s to the end of os, only if it is not included already.

Codex: append adds s to the orderedSet if it is not already present.

Figure 3. Examples of documentation by Codex (1-shot)

4. Conclusion and Future Work

We explored GPT-3 Codex for automatic documentation generation and compared its performance with existing approaches. While previous approaches are subject to task/language-specific retraining or fine-tuning, Codex shows SOTA performance even with very basic settings. In future, we intend to investigate Codex more in-depth in terms of parameter tuning, few shot learning, fine-tuning to improve its performance even further. We also plan to employ Codex to fix documentation issues (e.g., doc smells) that we found in our earlier study (Khan et al., 2021). In particular, this study has a number of limitations that we want to address in the future. First, we used 1K samples for each languages to evaluate Codex generated documentation. Though the sample size is statistically significant, we intend to test Codex on more samples in the future. Second, we limited our investigation only to zero and one-shot learning. We will extend the investigation by also analyzing few-shot learning. Third, we randomly picked one sample from the corresponding train set to use it in the one-shot learning. However, using different samples in one-shot learning might yield different outcome which has not been explored. Fourth, like other pre-trained transformer models, GPT-3 also supports fine-tuning. However, latest versions of GPT-3 models (e.g., text-davinci, Codex) are not available for fine-tuning. We could fine-tune GPT-3 when it is available to do so. Finally, systematic investigation of the effect of different parameters associated with GPT-3 has not been done in this study. We picked the parameter values based on the official documentation.

Acknowledgement. This work was funded by Natural Sciences and Engineering Research Council of Canada, University of Calgary, Alberta Innovates, and Alberta Graduate Excellence Scholarship.

References

(1)
Abid et al. (2015) Nahla J Abid, Natalia Dragan, Michael L Collard, and Jonathan I Maletic. 2015. Using stereotypes in the automatic generation of natural language summaries for c++ methods. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 561–565.
Aghajani et al. (2020) Emad Aghajani, Csaba Nagy, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, Michele Lanza, and David C. Shepherd. 2020. Software Documentation: The Practitioners’ Perspective. In 42nd International Conference on Software Engineering. 12.
Aghajani et al. (2019) Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software documentation issues unveiled. In 41st International Conference on Software Engineering. 1199–1210.
Ahmad et al. (2020) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020).
Ahmad et al. (2021) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
Allamanis et al. (2016) Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International conference on machine learning. PMLR, 2091–2100.
Barone and Sennrich (2017) Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
Chen and Zhou (2018) Qingying Chen and Minghui Zhou. 2018. A neural framework for retrieval and summarization of source code. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 826–831.
Chintagunta et al. (2021) Bharath Chintagunta, Namit Katariya, Xavier Amatriain, and Anitha Kannan. 2021. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Machine Learning for Healthcare Conference. PMLR, 354–372.
Chiu and Alexander (2021) Ke-Li Chiu and Rohan Alexander. 2021. Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407 (2021).
Dale (2021) Robert Dale. 2021. GPT-3: What’s it good for? Natural Language Engineering 27, 1 (2021), 113–118.
de Souza et al. (2005) Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia M. de Oliveira. 2005. A study of the documentation essential to software maintenance. In 23rd annual international conference on Design of communication: documenting & designing for pervasive information. 68–75.
DuBay (2004) William H DuBay. 2004. The principles of readability. Online Submission (2004).
Eddy et al. (2013) Brian P Eddy, Jeffrey A Robinson, Nicholas A Kraft, and Jeffrey C Carver. 2013. Evaluating source code summarization techniques: Replication and expansion. In 2013 21st International Conference on Program Comprehension (ICPC). IEEE, 13–22.
Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
Finnie-Ansley et al. (2022) James Finnie-Ansley, Paul Denny, Brett A Becker, Andrew Luxton-Reilly, and James Prather. 2022. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Australasian Computing Education Conference. 10–19.
Floridi and Chiriatti (2020) Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30, 4 (2020), 681–694.
Gao et al. (2021) Shuzheng Gao, Cuiyun Gao, Yulan He, Jichuan Zeng, Lun Yiu Nie, and Xin Xia. 2021. Code structure guided transformer for source code summarization. arXiv preprint arXiv:2104.09340 (2021).
Haiduc et al. (2010a) Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010a. Supporting program comprehension with source code summarization. In 2010 acm/ieee 32nd international conference on software engineering, Vol. 2. IEEE, 223–226.
Haiduc et al. (2010b) Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010b. On the use of automated text summarization techniques for summarizing source code. In 2010 17th Working Conference on Reverse Engineering. IEEE, 35–44.
Hu et al. (2018) Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC). IEEE, 200–20010.
Hu et al. (2020) Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (2020), 2179–2217.
Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
Ibrahim et al. (2012) Walid M Ibrahim, Nicolas Bettenburg, Bram Adams, and Ahmed E Hassan. 2012. On the relationship between comment update practices and software bugs. Journal of Systems and Software 85, 10 (2012), 2293–2304.
Iyer et al. (2016) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083.
Jackson and Saenz (2022) Ilya Jackson and Maria Jesus Saenz. 2022. From Natural Language to Simulations: Applying GPT-3 Codex to Automate Simulation Modeling of Logistics Systems. arXiv preprint arXiv:2202.12107 (2022).
Khan et al. (2021) Junaed Younus Khan, Md Tawkat Islam Khondaker, Gias Uddin, and Anindya Iqbal. 2021. Automatic detection of five api documentation smells: Practitioners’ perspectives. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 318–329.
Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 501–507.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
McBurney et al. (2017) Paul W McBurney, Siyuan Jiang, Marouane Kessentini, Nicholas A Kraft, Ameer Armaly, Mohamed Wiem Mkaouer, and Collin McMillan. 2017. Towards prioritizing documentation effort. IEEE Transactions on Software Engineering 44, 9 (2017), 897–913.
McBurney and McMillan (2014) Paul W McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension. 279–290.
Moreno et al. (2013) Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. 2013. Automatic generation of natural language summaries for java classes. In 2013 21st International Conference on Program Comprehension (ICPC). IEEE, 23–32.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
Parnas (2011) David Lorge Parnas. 2011. Precise documentation: The key to better software. In The Future of Software Engineering. Springer, 125–148.
Parvez et al. (2021) Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Summarization. arXiv preprint arXiv:2108.11601 (2021).
Pearce et al. (2021) Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2021. Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs? arXiv preprint arXiv:2112.02125 (2021).
Phan et al. (2021) Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. 2021. Cotext: Multi-task learning with code-text transformer. arXiv preprint arXiv:2105.08645 (2021).
Prenner and Robbes (2021) Julian Aron Prenner and Romain Robbes. 2021. Automatic Program Repair with OpenAI’s Codex: Evaluating QuixBugs. arXiv preprint arXiv:2111.03922 (2021).
Rai et al. (2022) Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. 2022. A Review on Source Code Documentation. ACM Transactions on Intelligent Systems and Technology (TIST) (2022).
Rai et al. (2017) Sawan Rai, Tejaswini Gaikwad, Sparshi Jain, and Atul Gupta. 2017. Method level text summarization for java code using nano-patterns. In 2017 24th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 199–208.
Rastkar et al. (2011) Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. 2011. Generating natural language summaries for crosscutting source code concerns. In 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, 103–112.
Robillard (2009) Martin P. Robillard. 2009. What Makes APIs Hard to Learn? Answers from Developers. IEEE Software 26, 6 (2009), 26–34.
Schreck et al. (2007) Daniel Schreck, Valentin Dallmeier, and Thomas Zimmermann. 2007. How documentation evolves over time. In Ninth international workshop on Principles of software evolution: in conjunction with the 6th ESEC/FSE joint meeting. 4–10.
Sridhara et al. (2010) Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 43–52.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
Uddin et al. (2022) Gias Uddin, Omar Alam, and Alexander Serebrenik. 2022. A qualitative study of developers’ discussions of their problems and joys during the early COVID-19 months. Empirical Software Engineering 27, 5 (2022), 1–52.
Uddin et al. (2012) Gias Uddin, Barthélémy Dagenais, and Martin P. Robillard. 2012. Temporal Analysis of API Usage Concepts. In Proc. 34th IEEE/ACM Intl. Conf. on Software Engineering. 804–814.
Uddin et al. (2021) Gias Uddin, Foutse Khomh, and Chanchal K Roy. 2021. Automatic API Usage Scenario Documentation from Technical Q&A Sites. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1–45.
Uddin and Robillard (2015) Gias Uddin and Martin P. Robillard. 2015. How API Documentation Fails. IEEE Softawre 32, 4 (2015), 76–83.
Vassallo et al. (2014) Carmine Vassallo, Sebastiano Panichella, Massimiliano Di Penta, and Gerardo Canfora. 2014. Codes: Mining source code descriptions from developers discussions. In Proceedings of the 22nd International Conference on Program Comprehension. 106–109.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wan et al. (2018) Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397–407.
Wang et al. (2020) Ruyun Wang, Hanwen Zhang, Guoliang Lu, Lei Lyu, and Chen Lyu. 2020. Fret: Functional reinforced transformer with bert for code summarization. IEEE Access 8 (2020), 135591–135604.
Wong et al. (2015) Edmund Wong, Taiyue Liu, and Lin Tan. 2015. Clocom: Mining existing source code for automatic comment generation. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 380–389.
Wong et al. (2013) Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question and answer sites for automatic comment generation. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 562–567.
Xia et al. (2017) Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shanping Li. 2017. Measuring program comprehension: A large-scale field study with professionals. IEEE Transactions on Software Engineering 44, 10 (2017), 951–976.