WhyGen: Explaining ML-powered Code Generation by Referring to Training Examples

Weixiang Yan Beijing University of Posts and TelecommunicationsBeijingChina [email protected] and Yuanchun Li Institute for AI Industry Research (AIR), Tsinghua UniversityBeijingChina [email protected]

Abstract.

Deep learning has demonstrated great abilities in various code generation tasks. However, despite the great convenience for some developers, many are concerned that the code generators may recite or closely mimic copyrighted training data without user awareness, leading to legal and ethical concerns. To ease this problem, we introduce a tool, named WhyGen, to explain the generated code by referring to training examples. Specifically, we first introduce a data structure, named inference fingerprint, to represent the decision process of the model when generating a prediction. The fingerprints of all training examples are collected offline and saved to a database. When the model is used at runtime for code generation, the most relevant training examples can be retrieved by querying the fingerprint database. Our experiments have shown that WhyGen is able to precisely notify the users about possible recitations and highly similar imitations with a top-10 accuracy of 81.21 $\%$ . The demo video can be found at https://youtu.be/EtoQP6850To.

Machine learning, code generation, recitation, intellectual property

1. Introduction

Deep learning has recently been applied to various code generation tasks and has shown remarkable progress (Lu et al., 2021; Wang et al., 2021). For instance, GitHub Copilot (Chen et al., 2021), a giant deep neural network developed by OpenAI, is able to generate highly-usable code from simple docstring or code prompts. Such code generators can greatly improve the efficiency of developers by letting them focus on the high-level design rather than on the implementation details.

However, many developers are worried about the use of copyrighted source code for training such ML-powered code generators. The machine learning models may have memorized the training data and generate code that is verbatim or very similar to the training examples. Consequently, it may lead to licensing infringement if it generates and injects copyrighted code into customers’ software.

Although there has already been a lot of debates on this issue from the legal perspectives (cop, 2021; Franceschelli and Musolesi, 2021; Pearce et al., 2021), how to technically ease this tension is still an open problem. There is an inevitable trade-off between achieving higher accuracy and reducing training data memorization. The success of today’s DNN-powered code generators is largely due to their remarkable accuracy, and thus sacrificing the accuracy for less ethical concern may not be a sustainable solution.

We argue that a better way out is to keep the accurate training as it is, while additionally referring to the relevant training examples upon code generation. On the one hand, the users of the code generators can understand why a certain code snippet is generated and learn more details from the referred examples (including the license and detailed usage). On the other hand, the code generators do not need to sacrifice accuracy by reducing training data or memorization. Achieving this goal is challenging since DNN models are usually regarded as black boxes that are very difficult to interpret.

To this end, we introduce WhyGen, a tool to explain the predictions of ML-powered code generators by examples. WhyGen solves the aforementioned problem by introducing a novel data structure, named inference fingerprint, to represent the design process of a model. An inference fingerprint is a vector of activation values produced by a set of critical intermediate neurons in the network during the inference pass. The fingerprint vectors can be compared across different inference passes, where similar samples would yield similar fingerprints. Therefore, when the model is used online for code generation, we can compare the generated fingerprint with the fingerprints produced by the training examples, and retrieve the most relevant training examples to explain the generation.

We implement WhyGen on a popular open-source DNN-based code generator named CodeGPT (Lu et al., 2021) and test it on the PY150 dataset (Raychev et al., 2016). We randomly select 10,000 test examples that recite training data (i.e., generating code snippets that are the same or very similar to uncommon training examples), and check whether WhyGen can find the recited examples at inference time. We find that WhyGen can precisely locate the related training examples with a top-10 accuracy of 81.21 $\%$ . Meanwhile, the latency of retrieving related training examples is around 6 ms, which is minimal as compared to code generation.

The rest of this paper is organized as follows. Section 2 introduces the design of the tool. Section 3 presents the accuracy and performance of the tool based on experiments. Section 4 and Section 5 introduce the related work and future work, respectively.

Refer to caption — Figure 1. The workflow of WhyGen to explain DNN-powered code generation by examples.

2. Tool Design

The workflow of WhyGen is shown in Figure 1. For each query code given by the user (a programmer who is using the ML-powered code generator), we extract an inference fingerprint from the neural network. The fingerprint is used to query a fingerprint dataset to find the most similar fingerprints and their corresponding training examples. The retrieved training examples are then returned to the user with the code generated by the model, giving them prompts about which training examples are potentially relevant to the current generation. We also provide the source (e.g., the link to the original GitHub repository) of each relevant training example to the user for further reference.

2.1. Inference Fingerprint

Understanding which training samples are more relevant to a certain generation is challenging, because neural networks are usually regarded as black boxes that are difficult to interpret. The training examples are used to compute gradients that accumulate into millions of model weights. It is hard to distinguish the contribution of each training example after the model parameters are learned.

Instead of analyzing which training examples contribute the most to the code generation, we analyze which training examples trigger similar decision logic as the user query. We assume the training examples with similar decision logic are the relevant examples for the generated code. This assumption, though not formally provable, is intuitive because human brains also process relevant concepts with similar decision pattern.

We introduce a data structure, named inference fingerprint, to represent the decision logic of the neural network and compare across different data examples. An inference fingerprint is a vector of activation values produced by a set of intermediate neurons in the network during the inference pass. The same set of intermediate neurons is used to produce the fingerprints, and thus the fingerprints are comparable across different data examples. Prior work has attempted to use intermediate neurons to represent the decision logic of DNN (Zhang et al., 2020; Liu et al., 2020), but they are mainly designed for other purposes (such as adversarial detection, data distribution estimation, etc.) and the computation of critical neurons is relatively slow.

In our work, the selection of the intermediate neurons for producing fingerprints must meet two objectives. First, the number of selected intermediate neurons must be small, since the total number of neurons in a code generator model is too huge to compute. Second, the selected intermediate neurons should be representative, so that the relevant code examples can be grouped together.

Modern code generators are mostly based on the Transformer architecture (Lu et al., 2021; Wang et al., 2021; Ahmad et al., 2021).A typical inference step of a Transformer-based code generator is illustrated in Figure 2, in which the input is a sequence of preceding code tokens, and the output is the predicted next token. Each piece of generated code is produced in a token-by-token manner, where each token is predicted by an inference step. The predicted token in a step is appended to the query sequence and used as the input to predict the subsequent token in the next step.

Taking CodeGPT (Lu et al., 2021) as an example, it takes a sequence of tokens as the input and predicts the next token step by step until the <end> identifier is predicted. In each step of next-token prediction, CodeGPT uses the Beam Search algorithm to retain the top-k candidate tokens with the highest scores. Then for each of these top-k candidates, it further runs the inference pass and finds the top-k highest-score candidate tokens, resulting in $K^{2}$ candidate combinations. Among them, only the top-k candidate combinations with the highest scores are kept in the next step, and the process repeats until the end of decoding. Finally, the candidate token combination with the highest score is returned as the final prediction.

We combine the heuristic understanding of the model and quantitative methods to locate the intermediate neurons. We first narrow down the selection of intermediate neurons to the activation layers after each encoder module, because they are designed as the result of each independent encoding stage. Moreover, we focus on the activation values corresponding to the first generated token since they have encoded all user-generated input tokens and are more explicitly related to the generated code.

To further locate the neurons that can better represent the decision process, we use a profiling phase to understand the behavior of the neurons in activation layers. The training samples are fed into the model and the neuron output values are recorded. We compute several statistics based on the profiling results and compare several criteria to select the critical neurons. We find that the most high-variance neurons are more representative, and their output values are concatenated together as the inference fingerprint.

2.2. Training Data Indexing and Retrieval

Next, we compute the inference fingerprints for all training examples and save them to a database. The inference fingerprint generation process for the training examples is consistent with the process for user input (as described in Section 2.1), in order to ensure that the inference fingerprints of training examples can be compared and searched with the fingerprint generated by the user input at the test time. Each record in the database includes the inference fingerprint, the code snippet, and the original source (e.g., repository URL and/or file path) of the code. The fingerprint vectors are indexed to speed up the process of searching for the most relevant training examples.

When the code generator produces a prediction, we compute the inference fingerprint for the prediction, and find the most similar fingerprints in the database. The similarity is measured as the Euclidean distance between the two vectors. The training examples corresponding to the most similar inference fingerprints are returned to the user as the relevant training examples.

2.3. Implementation Details

We implement the prototype of WhyGen with an open-source DNN-powered code generator CodeGPT (Lu et al., 2021), which is based on an advanced language model GPT-2 (Radford et al., [n. d.]) and fine-tuned on the PY150 dataset (Raychev et al., 2016). The state-of-the-art closed-source code generator, Codex or Copilot (Chen et al., 2021), is based on GPT-3 architecture. While larger in size, GPT-3 is conceptually and structurally similar to GPT-2. Thus, we believe our method can be applied to it as well.

To index and search for the fingerprints, we use the Faiss open-source library (Johnson et al., 2017). The size of the inference fingerprint is set to 100 in our implementation, and the number of returned relevant training examples is set to 10 by default.

3. Evaluation

We conduct experiments to evaluate WhyGen in terms of effectiveness (whether it can generate meaningful relevant training examples) and overhead (how much time it needs to retrieve the relevant examples).

3.1. Experiment Setup

Since the relevance of training examples is a subjective concept, directly evaluating it is difficult. Thus, we take an indirect approach instead - we first find some reciting behaviors of the code generator (i.e., the generator generates code exactly the same as in the training set). The recitations are regarded as the ground truth of relevant examples, so the effectiveness of WhyGen can be evaluated by examining whether the recited code snippets appear in the results produced by WhyGen.

To find the recitations, we randomly pick 10,000 code snippets from the test set and use the code generator to predict the next line for each snippet. For each predicted line of code, we search the training dataset to find the most similar line, i.e., the line with the shortest edit distance to the predicted line. If the edit distance is 0 and the code line is unique enough (number of occurrences is smaller than 10), we consider it as a recitation. In the end, we obtain 3,842 cases of recitations. We use the top-k accuracy metric to evaluate WhyGen, which means the probability that the recited training example is among the top k examples returned by WhyGen.

3.2. Effectiveness of WhyGen

Method	Acc@10	Acc@5	Acc@1
WhyGen	81.21 $\%$	79.28 $\%$	73.84 $\%$
Random	67.57 $\%$	66.61 $\%$	62.78 $\%$
Maximum	56.32 $\%$	54.89 $\%$	51.09 $\%$
Minimum	57.26 $\%$	55.62 $\%$	52.32 $\%$
FFN	79.46 $\%$	77.98 $\%$	73.43 $\%$

Table 1. The accuracy of WhyGen and its variants to include the recited code in the retrieved training examples.

Based on the found recitations, we evaluate the effectiveness of WhyGen. Due to the lack of baselines in this area, we compare the default configuration of WhyGen with several variants. Each variant uses a different strategy to select the critical neurons to compute the inference fingerprint. For example, “Random” means to randomly select the intermediate neurons, “Maximum” and “Minimum” mean to select the neurons with maximum or minimum output values, and “FFN” means to select high-variance neurons from the feed-forward network layer rather than the self-attention layer.

The accuracy results are shown in Table 1. Clearly, our default configuration of WhyGen achieves the best results with a top-10 accuracy of 81.21% and top-1 accuracy of 73.84%, which is significantly better than using other criteria to select the fingerprint neurons. Selecting critical neurons from the FFN layer can achieve competitive results, but it is still slightly less effective than using the self-attention layers.

The accuracy results imply that the inference fingerprint computed by WhyGen does a good job in encoding important information about the decision-making process during the code generation, and it can effectively be used to find the training samples that share the similar decision logic with the query sample.

Figure 3 shows an example of relevant training examples returned by WhyGen when generating the next line for a given query code. We can see that the returned five training examples are almost all very relevant to the query code and generated code, and the first example is an exact recitation. In practice, the returned relevant training examples can serve as a reminder or guidance for the user. If the generated code recites or highly imitates the copyrighted code, the user can modify or abandon the generated code to avoid legal and ethical concerns. WhyGen will also provide the source path of the returned training examples, so that users can learn more about the code predicted by the code generator and decide whether to use it in their own software.

3.3. Overhead of WhyGen

We further measure the overhead of WhyGen in training and serving scenarios using a Linux server with an AMD EPYC 7742 CPU.

In the training stage, WhyGen needs to compute the fingerprints for all training examples and build an index for the fingerprints. The whole process takes around 20 hours, which is shorter than the training time of code generator models (around 25 hours). We believe the training overhead is acceptable since it is a one-time offline cost.

In the serving stage, WhyGen needs to compute the inference fingerprint and retrieve relevant examples for each prediction made by the code generator. The overhead is around 6 ms, which is minimal as compared to the code generation process (360 ms). Thus, we believe our tool can be used in real-time to give meaningful prompts to the code generator users.

4. Related Work

Instance-based Model Interpretation. Interpreting deep neural networks with training examples has become one of the major methods for model interpretation. The most representative instance-based interpretation technique is the influence function approach (Koh and Liang, 2017), which traces a model’s predictions through its learning algorithm and back to the training data using influence functions. However, the calculation of the influence function is very computationally intensive, making it difficult even impossible to be applied to large language models and datasets.

Privacy leakage in language models. The training example recitation problem in code generators is similar to the privacy leakage problem in language models, which has been discussed intensively in prior work (Pan et al., 2020; Inan et al., 2021; Carlini et al., 2020). In order to reduce such privacy concerns, a common solution is using differential privacy techniques (Abadi et al., 2016), i.e., adding noise during training to avoid memorizing individual details. However, applying differential privacy may significantly harm model accuracy, specifically for large language models (Bagdasaryan et al., 2019).

5. Conclusion and Future Work

We introduce a tool to explain the code generated by DNN models by referring to training examples. The tool can possibly be used as an IDE plugin along with the auto-completion feature. We hope our technique can help reduce the concern about using unauthorized source code for training code generators.

As future work, we plan to improve the accuracy of retrieving relevant training examples by exploring better inference fingerprints. We also plan to extend WhyGen to support more and larger code generators based on the Transformer architecture and other architectures such as CNN and RNN, in order to ensure good generalizability and practicability of WhyGen. A larger and more standard benchmark would be useful to better evaluate different training examples retrieving methods. Moreover, it would be interesting and helpful to investigate better quantitative metrics to measure the causal relationship between the training examples and the generated code, which can be used to evaluate WhyGen and other explain-by-example techniques more comprehensively and rigorously.

Our tool is open-sourced at https://github.com/WeixiangYAN/WhyGen.

References

(1)
cop (2021) 2021. GitHub, Copilot and the Copyright Around AI. https://www.plagiarismtoday.com/2021/07/08/github-copilot-and-the-copyright-around-ai/.
Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
Ahmad et al. (2021) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. arXiv:2103.06333 [cs.CL]
Bagdasaryan et al. (2019) Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. 2019. Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems 32 (2019), 15479–15488.
Carlini et al. (2020) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2020. Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020).
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
Franceschelli and Musolesi (2021) Giorgio Franceschelli and Mirco Musolesi. 2021. Copyright in Generative Deep Learning. arXiv:2105.09266 [cs.CY]
Inan et al. (2021) Huseyin A Inan, Osman Ramadan, Lukas Wutschitz, Daniel Jones, Victor Rühle, James Withers, and Robert Sim. 2021. Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405 (2021).
Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International Conference on Machine Learning. PMLR, 1885–1894.
Liu et al. (2020) Bingyan Liu, Yuanchun Li, Yunxin Liu, Yao Guo, and Xiangqun Chen. 2020. Pmc: A privacy-preserving deep learning model customization framework for edge computing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 4 (2020), 1–25.
Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv preprint arXiv:2102.04664 (2021).
Pan et al. (2020) Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. 2020. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1314–1331.
Pearce et al. (2021) Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2021. An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions. arXiv:2108.09293 [cs.CR]
Radford et al. ([n. d.]) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. [n. d.]. Language models are unsupervised multitask learners. ([n. d.]).
Raychev et al. (2016) Veselin Raychev, Pavol Bielik, and Martin T. Vechev. 2016. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (2016).
Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv preprint arXiv:2109.00859 (2021).
Zhang et al. (2020) Ziqi Zhang, Yuanchun Li, Yao Guo, Xiangqun Chen, and Yunxin Liu. 2020. Dynamic slicing for deep neural networks. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 838–850.