Non-autoregressive Model for Full-line Code Completion

Fang Liu^1,2 Zhiyi Fu^1,2 Ge Li^1,2 Corresponding authors. Zhi Jin^1,2∗ Hui Liu³ Yiyang Hao⁴
¹Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education
² School of Computer Science, Peking University, Beijing, China
³Beijing Institute of Technology
⁴Silicon Heart Tech Co., Ltd
{liufang816, ypfzy, lige, zhijin}@pku.edu.com, [email protected], [email protected]

Abstract

Code completion tools are frequently used by software developers to accelerate software development by suggesting the following code elements. Completing a sequence of code tokens (e.g., a full line of code) has been proved more efficient than predicting a single token at a time. To complete the code sequence, researchers are employing AutoRegressive (AR) decoders to generate tokens in a left-to-right, token-by-token fashion. Consequently, the prediction of the next token depends on all previously generated tokens, which leads to high latency in inference. To improve the efficiency and accuracy of full-line code completion, in this paper, we propose a Non-AutoRegressive (NAR) model for code completion boosted by a syntax-aware sampling strategy. Our experimental results on two widely used datasets suggest that our model outperforms both AR and NAR baselines on full-line code completion, and it is faster than the AR model with up to $9\times$ speed-up.

1 Introduction

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), improving the software development efficiency by suggesting future code snippets. In recent years, as the development of machine learning and deep learning technologies and easy-to-acquire open-source codebases, researchers have started to tackle code completion by learning from large-scale code corpora. Various Language Models (LM) have been employed for code completion, including N-gram (Hindle et al., 2016), RNN (Li et al., 2018), and Transformer based models (Liu et al., 2020). Several popular industry code completion tools, e.g., Tabnine (2018), aiXcoder (2018), and Copilot (2021), also rely on LMs to provide suggestions for whole lines or entire functions in a token-by-token manner. Recently, Wang et al. (2020) further explored full-line code completion. Given a partially completed code snippet, they built a Transformer-based LM to predict the next line of code. Although such code completion algorithms/tools can suggest longer code snippets, they intrinsically are suffering from inefficiency because of the autoregressive decoding process, where each token is generated conditioned on all the previously generated tokens. Consequently, the generation is not parallelizable and thus particularly slow. However, the efficiency of the generation process is of great importance for code completion because it is expected to respond instantly on the developer’s own devices.

Tokens within a code line have the potential to be predicted concurrently, although existing code completion algorithms predict them one by one (from left to right) by exploiting all tokens previously predicted. For example, multiple arguments of the same method are independent of each other, and thus they could be predicted independently and concurrently. In Python code, you can even change the order (position) of the arguments if a function with the keyword ‘arguments’ is called. For example, Func(arg1=a, arg2=b) is equivalent to Func(arg2=b, arg1=a). We also argue that even if there exists strong dependency among tokens within a code line, the dependency is not necessarily left-to-right. A typical example is “one line IF-ELSE statement” (Python), formed as: value_1 if condition else value_2. The $value\_1$ is dependent on the following condition (on its right-hand side), instead of the verse. We also perform in-depth analysis in section 2 to verify that the dependency among code tokens is weaker than that among tokens in natural languages where parallelized generation has been employed successfully. Our analysis in section 2 also suggests that predicting code tokens from right-to-left or the verse (left-to-right) results in comparable performance. Based on those observations, we conclude that generating tokens in parallel for code completion is possible and reasonable.

Non-autoregressive (NAR) model, proposed by Gu et al. (2018), has been successfully applied to Natural Machine Translation (NMT) to generate tokens non-autoregressively through a parallel decoding process:

\mathcal{L}_{NAR}=\sum_{t=1}^{T}\log p(y_{t}|X;\theta)

(1)

It allows an order of magnitude lower latency during inference. There are two kinds of NAR models: Fully NAR models (Gu et al., 2018; Ma et al., 2019) and iterative NAR models (Ghazvininejad et al., 2019; Kasai et al., 2020). These NAR models can speed up the inference process of NMT significantly compared with autoregressive models. However, NAR models are hard to train without specially designed training strategies, as the left-to-right inductive bias among target tokens is ignored during decoding.

Considering the feasibility of parallel generation of code and inspired by the success of Non-AutoRegressive generation in NLP, in this paper, we propose a Syntax-Aware Non-AutoRegressive model (SANAR) for code completion. Specifically, we propose an adaptive and syntax-aware sampling strategy to boost the training process of SANAR, which dynamically glances some code tokens from the target sequence according to their difficulties and token types. The better the model is trained, the fewer code snippets would be glanced. Once the model is well-trained, it can generate the whole line of tokens in a single pass. Our experimental results show that our sampling strategy results in obvious improvement in performance on both Python and Java code completion. The proposed approach is significantly faster than the widely used Auto-Regressive models, achieving $5\sim 6\times$ speed-up on average, and $9\times$ for long targets.

To summarize, the major contributions of our work are as follows:

•

An empirical study whose results suggest that it is potentially practical to predict code tokens in parallel.
•

A novel approach for full-line code completion. To the best of our knowledge, it is the first non-autoregressive approach to code completion. We boost the approach by an adaptive and syntax-aware sampling strategy, that is specially designed for source code.
•

A large-scale evaluation of the proposed approach whose results suggest that our approach outperforms both AR and NAR baselines on full-line code completion, and significantly reduces the inference time compared with the AR model.

2 Dependency among Generated Code Tokens

We argue that the dependency among the generated tokens in full-line completion is not as strong as in NMT (where NAR has been successfully applied), and the left-to-right generation order is not always optimal for code completion. To verify this assumption, we conduct experiments to analyze the dependency among target tokens in the full-line code completion task in this section.

Table 1: The percentage of the code sequence that can be correctly handled by different models. EM stands for exact match, and ES stands for edit similarity.

		Metrics	Percentage
PY	only Transformer-L2R	EM	2.65%
	only Transformer-L2R	ES>0.5	6.57%
	only Transformer-R2L	EM	2.47%
	only Transformer-R2L	ES>0.5	7.21%
	Both	EM	13.59%
	Both	ES>0.5	60.39%
JAVA	only Transformer-L2R	EM	3.28%
	only Transformer-L2R	ES>0.5	6.71%
	only Transformer-R2L	EM	3.86%
	only Transformer-R2L	ES>0.5	8.22%
	Both	EM	26.15%
	Both	ES>0.5	60.43%

2.1 Reversing the Order of Code Generation

To verify our conjecture of left-to-right is not always the optimal order for code completion, we conduct an experiment to analyze the impact of different generation orders. We employ standard auto-regressive Transformer architecture to perform full-line code completion using two reverse orders: left-to-right (Transformer-L2R) and right-to-left (Transformer-R2L). Table 1 presents the results. For most of the cases, the performances of the L2R model and R2L model are comparable. For Python language, 2.65% and 2.47% of programs can only be correctly generated by L2R and R2L, respectively; 6.57% and 7.21% of the programs can only be approximately generated (edit similarity >50%) by L2R and R2L, respectively. For Java programs, we can also get similar results. Therefore, we conclude that left-to-right is not always the optimal order for code completion. Thus generating code in parallel (non-autoregressively) could be a better choice.

Figure 1: The attention density ratio R(P) under different masking probability P in code completion and NMT.

Refer to caption — Figure 2: The training procedure with syntax-aware sampling strategy in SAS.

2.2 Quantitative Dependency among Code Tokens

In this section, we measure the quantitative dependency among the generated tokens by attention density and compare the dependency among code tokens against that among natural language tokens (in NMT tasks). NMT task is selected for comparison because NAR models have been successfully applied to it.

To characterize and quantify the target-token dependency, inspired by Ren et al. (2020), we build Dependency-Analysis-Model (DAM) to measure the dependency among target tokens using the ratio of the attention weights on target context over that on full (both source and target) context when predicting a target token, where a bigger ratio indicates a larger dependency among target tokens. Specifically, we use masked language modeling task (Devlin et al., 2019) to train DAM. We employ mix-attention (He et al., 2018) to calculate the attention weights, where source tokens can only attend to source tokens with self-attention and target tokens can attend to all source and target tokens with mix-attention. During training, we randomly mask the tokens on the target side, and the model is trained to predict the original value of these tokens. After the model has been trained, we can measure the target token dependency based on the ratio of attention weights $\alpha_{i}$ on target context over that on full context when predicting a specific target token $y_{i}$ , which is defined as:

\alpha_{i}=\frac{\frac{1}{N}\sum_{j=1}^{N}A_{i,j}}{\frac{1}{N}\sum_{j=1}^{N}A_{i,j}+\frac{1}{M}\sum_{j=N+1}^{N+M}A_{i,j}}

(2)

where $A_{i,j}$ denotes the attention weights from token $i$ to token $j$ in mix-attention, and $j\in[1,N]$ represents the target token, and $j\in[N+1,N+M]$ represents the source token. $M$ and $N$ is the length of source and target input, respectively. $\sum_{j=1}^{N+M}A_{i,j}=1$ . $\alpha_{i}$ represents the ratio of attention density on target context when predicting target token $i$ . For a given masking probability $P$ , the final attention density ratio $R(P)$ is calculated by averaging $\alpha_{i}$ over all test data. As seen in Figure 1, the attention density ratio for NMT is bigger than code generation for all masking probability $P$ , which demonstrates the dependency among the target tokens in code completion is less than NMT.

We conclude based on the preceding analysis that it is potentially practical to predict code tokens in parallel.

3 SANAR

3.1 Overview

In this section, we present our model SANAR in detail. We build SANAR to perform full-line code completion in parallel, which uses conditional independent factorization for the target tokens. Specifically, given the contextual code sequence $X=\{x_{1},x_{2},...,x_{m}\}$ , the non-autoregressive full-line code completion task is to predict a sequence of tokens $Y=\{y_{1},y_{2},...,y_{n}\}$ that form a complete statement in a non-autoregressive way:

P(y_{1},...,y_{n}|x_{1},x_{2},...,x_{m})=\prod_{t=1}^{n}P(y_{t}|x_{1},x_{2},...,x_{m})

(3)

Figure 2 shows the architecture and the trainig procedure of our model. SANAR adopts the encoder-decoder transformer architecture: a context encoder that does self-attention, and a generated-code decoder that has one set of attention heads over the encoder’s output and another set (self-attention) for the generated code. During the training procedure, we propose an adaptive syntax-aware sampling strategy, which enable SANAR to generate target code sequences in a single-pass during inference procedure by gradual training.

3.2 Training

During training, SANAR adopts an adaptive Syntax-Aware Sampling (SAS) strategy to dynamically glance code snippets from the target sequence depending on its difficulty and the tokens’ syntax types, aiming to incorporate the explicit syntactic information of the program. To achieve the sampling strategy, SANAR performs decoding twice during training. As shown in Figure 2, the first decoding is performed in a fully-NAR way, where the input to the decoder $H=\{h_{1},h_{2},...,h_{n}\}$ are copied from the encoder output using soft-copy (Wei et al., 2019), which maps the encoder embeddings $S=\{s_{1},s_{2},...,s_{m}\}$ into target length $H=\{h_{1},h_{2},...,h_{n}\}$ depending on the distance relationship between the source position $i$ and the target position $j$ . The initial predicted tokens $\hat{Y}$ are predicted as:

\hat{Y}=f_{dec}(\text{soft-copy}(f_{enc}(X;\theta^{\prime}));\theta)

(4)

The prediction accuracy indicates the difficulty of fitting current target. In the second decoding, SANAR samples words of the targets as the extra decoding input by SAS sampling according to the first decoding results and the syntax information of the target tokens, and learn to predict the rest words that are not selected. It is important to note that only the second decoding will update the model parameters.

The training objective is to maximize the following:

\mathcal{L}_{SAS}=\sum_{y_{t}\in Y\backslash\mathbb{SAS}(Y,\hat{Y},T)}\log P(y_{t}|X,\mathbb{SAS}(Y,\hat{Y},T);\theta)

(5)

where $\hat{Y}$ is the initially predicted tokens in a fully-NAR decoding way. $T$ is the syntax type sequence of the target sequence, and $\mathbb{SAS}(Y,\hat{Y},T)$ is a subset of tokens selected via the adaptive SAS strategy.

3.3 Syntax-Aware Sampling Strategy

Syntax-Aware Sampling (SAS) strategy is an important component of SANAR, which adaptively selects the positions of tokens from the target sequence depending on its difficulty and the syntax types of the tokens. The selected tokens can provide “correct” information from the ground-truth target, thus can reduce the burden of decoder to predict the rest non-selected tokens in the training phase. SAS samples more tokens for SANAR to glance at the beginning, and then reduces the sampling number gradually, which helps SANAR to learn, eventually, how to predict the whole line code snippet without seeing any ground truth tokens in one pass.

Formally, given the context code sequence $X$ , its predicted token sentence $\hat{Y}$ , the ground truth $Y$ , and the token’s syntax type sequence of the target $T$ , the goal of SAS strategy $\mathbb{SAS}(Y,\hat{Y},T)$ is to obtain a subset of tokens sampled from $Y$ . Specifically, there are two steps:

1) Deciding the sampling numbers $N$ depending on the difficulty of correctly generating the target token sequence:

N=\lambda\cdot dis(Y,\hat{Y})

(6)

$N$ is computed by comparing the difference between $\hat{Y}$ and $Y$ , and we adopt the Hamming distance as the metric following Qian et al. (2021), $dis(Y,\hat{Y})=\sum_{t=1}^{T}(y_{t}\neq\hat{y_{t}})$ . $\lambda$ is the sampling ratio, which is a hyper-parameter to flexibly control the number of sampled tokens. More tokens will be selected and fed as input of the second-pass decoding if the network’s initial prediction is less accurate. Thus, the sampling number can be decided adaptively considering the current trained model’s prediction capability and the training sample’s complexity.

2) Sampling $N$ tokens from the target sequence. The most direct method is to randomly select $N$ tokens from $Y$ like in Qian et al. (2021). However, we argue that considering the syntactic information of the code explicitly during token selection will be beneficial for understanding the programs, and thus can help train the decoder to predict the rest tokens more precisely. In programs, the keyword, identifiers, and operators contain more symbolic and syntactic information than other tokens (for example, literals and separators in Java, like ‘;’, ‘{’, ‘}’, etc.). Glancing these tokens will help a lot.

To capture the syntactic information of the code during the sampling procedure, we present a Hybrid-Syntax-Guided (HSG) sampling strategy. Specifically, $1-p$ of the time, the sampling is performed randomly, where the tokens are randomly selected from the target sequence; and $p$ of the time, tokens are selected depending on their syntax-types¹¹1 $p$ is a hyper-parameter to control the probability of syntax-guided sampling, the best setting is $p=30\%$ , we randomly select $K\leq N/2$ keywords, $I\leq N/4$ identifiers, and $O\leq N/4$ operators from $Y$ , where $K+I+O\leq N$ ²²2The number of these specific tokens might be not enough. The reason for introducing randomness (randomly sampling for $1-p$ of the time) in the sampling process is to enable SANAR to explore more interdependency among target tokens. Compared with the totally randomly sampling strategy in Qian et al. (2021), our Hybrid-Syntax-Guided sampling strategy can increase the probability of glancing the keyword, operators, and identifiers, which can help the SANAR capture the program’s syntactic and semantic information better.

Finally, we can obtain a subset of words sampled from $Y$ :

\begin{split}\mathbb{SAS}(Y,\hat{Y},T)&=HSG(Y,N,p)\\ N&=\lambda\cdot dis(Y,\hat{Y})\end{split}

(7)

3.4 Inference

As SANAR is well-tuned, it will adaptively reduce the percentage of sampling, making sure that the trained model could learn to generate the whole line of code in the single pass. Thus, the inference of SANAR is fully parallel with only a single pass.

In traditional left-to-right code completion, where the target sequence is predicted token by token, the length of the sequence can be determined by predicting a special EOS (end of sentence) token. However, for NAR-based generation where the entire sequence is predicted in parallel, the length of the target sequence must be determined in advance. To achieve this, we follow Ghazvininejad et al. (2019) to add an additional [LENGTH] token to the source input, and the encoder output for the [LENGTH] token is used to predict the length, formed as a max-target-length classification task. We use 1,000 as the max length. The loss is added to the cross-entropy loss from the target sequence.

4 Experiments

4.1 Experimental Settings

Datasets. We conduct experiments on two program benchmarks crawled from Github repositories: Py150 (Raychev et al., 2016) contains 150,000 Python files, split into a training set of 100,000 files and test set of 50,000 files, GitHub-Java (Allamanis and Sutton, 2013) contains 14,317 projects, we follow Hellendoorn and Devanbu (2017) and Karampatsis et al. (2020) to split validation and test set, and randomly sample 1,000 projects from the rest projects as the training set.

We use the Python official library tokenizer³³3https://docs.python.org/3/library/tokenize.html and Javalang⁴⁴4https://github.com/c2nes/javalang to split Python and Java programs into lines of tokens, and extract their syntax types. Then we employ a sliding context window of 10-lines to create the data pairs, that is, the context code sequence contains 10-lines of code tokens, and the next line is considered as the target code sequence. The detailed information of the datasets is shown in Table 2.

Table 2: Dataset statistics.

	Python	Java
# of training pairs	7,531,208	12,993,112
# of testing pairs	3,693,213	671,236
Avg. tokens in cxt	90.8	71.2
Avg. tokens in tgt	9.4	6.9

Metrics and Baselines. We use the following metrics to evaluate the performance of our approach:

•

Exact Match Accuracy (EM): We compare the exact matching accuracy between the generated code sequence and the ground truth.
•

BLEU: We use the BLEU score to measure the n-gram similarity between the target code sequence and the generated code sequence.
•

Edit similarity (ES): We use the character-level edit similarity of the predicted output $\hat{Y}$ and the target output $Y$ , which is computed as:

$ES=1-\frac{Lev(\hat{Y},Y)}{|\hat{Y}|+|Y|}$ (8)

where Lev is Levenshtein distance.
•

Latency: The inference latency is computed as the time to decode a single target code sentence without mini batching, averaged over the whole test set. The decoding is implemented in PyTorch on a single NVIDIA Tesla V100.

We compare our model with the following baselines, including both strong AR Transformer and several representative NAR models:

•

Transformer (Vaswani et al., 2017): Base Transformer seq2seq architecture, where the decoder is AR. We also tried the TransformerLM decoder architecture in Wang et al. (2020) using the comparable number of parameters. The results show that seq2seq architecture can outperform the LM decoder in performance, and with a comparable decoding speed. Thus we use Transformer (seq2seq) model as the strong AR baseline, and also use seq2seq framework as our base architecture instead of LM decoder.
•

CMLM (Ghazvininejad et al., 2019): A strong iterative-NAR model, which adopts multi-pass decoding strategy to refine the target tokens.
•

GLAT (Qian et al., 2021): A strong fully-NAR model, which can generate high-quality target only with single-pass parallel decoding.

Setup. For all the seq2seq baselines and our model, we use a 6-layers of encoder and decoder with a model size of 512⁵⁵5For TransformerLM (Wang et al., 2020), to keep the comparable number of the parameters, we employ a 12-layers decoder with a model size of 512.. We set the vocabulary size to 50,000 for both Python and Java datasets. We train the model with batches of 16k tokens using Nvidia V100 GPUs. We set the dropout rate to 0.1 and use Adam optimizer with $\beta=(0.9,0.999)$ . The learning rate warms up to 5e-5 in 4k steps and gradually decays according to the inverse square root of the step number (Vaswani et al., 2017). Following Qian et al. (2021), we set $\lambda$ to 0.3. For the hyper-parameter $p$ in Hybrid-Syntax-Guided sampling strategy, we use 30% as the final value, which can achieve the best results. The specific syntax types used for guiding the sampling include keyword, operator, and identifier.

Table 3: Results on Python benchmark.

Model	BLEU	EM	ES	Latency	Speedup
Transformer	21.93	16.24	62.73	121ms	$1.0\times$
CMLM	23.98	13.44	63.82	39ms	$3.1\times$
GLAT	26.78	16.95	66.36	20ms	$6.1\times$
SANAR	29.07	18.35	66.90	20ms	$6.1\times$

Table 4: Results on Java benchmark.

Model	BLEU	EM	ES	Latency	Speedup
Transformer	25.73	30.33	63.95	101ms	$1.0\times$
CMLM	28.97	28.36	63.66	31ms	$3.3\times$
GLAT	30.56	28.63	65.80	20ms	$5.1\times$
SANAR	32.41	30.57	66.98	19ms	$5.3\times$

4.2 Main Results

The main results on Python and Java benchmarks are presented in Table 3 and 4. SANAR significantly improves the full-line code completion quality and outperforms all the baselines on all evaluation metrics. Especially, SANAR outperforms strong AR baseline (Transformer) by a large margin in BLEU. We can observe that the improvements on BLEU and Edit Similarity are more significant than the Exact Match accuracy. As a recommendation tool, code completion add-in serves as developers’ cooperator, and developers can accept to make a few edits to correct the recommended code. Besides, different code snippets can have identical functionality. For example, these two Python return statements have the same functionality: return True if a==0 else False and return False if a!=0 else True. Only evaluating the quality of the generated code by comparing the exact match with the ground truth is not convincing enough. Thus, BLEU and Edit Similarity which calculate the token overlapping and the minimum number of operations required to transform the predicted code sequence into the target sequence can evaluate the generated code more thoroughly. Thus, the significant improvements on BLEU and ES further demonstrate that SANAR can generate the code sequence which meets the developers’ expect more.

It is interesting to note that, most of the NAR models can achieve higher BLEU and ES scores than the AR-based Transformer model, and can achieve better or comparable EM accuracy on all the datasets. These results are quite different from NMT task, where the NAR models can achieve worse or comparable results with the AR model. The results further verify our assumptions in section 2, that is, the non-autoregressive manner can fit code completion better than NMT, and can boost the performance on both efficiency and quality for code completion.

Table 5: Speed-up on long target sequences.

Model	Latency	Speed up
Transformer	171ms	$1.0\times$
CMLM	34ms	$5.0\times$
GLAT	19ms	$9.0\times$
SANAR	19ms	$9.0\times$

For the inference latency, SANAR can significantly reduce the inference time within $5\sim 6\times$ speedup compared with the AR-based Transformer model. As a fully-NAR model, the inference speed of SANAR is also faster than the iterative-NAR CMLM model. Compared with NMT, the target sequence in full-line code completion is shorter, thus the overall speedup is not as large as in NMT, but is still significant. We also present the latency comparison for completing lines of $\geq 10$ tokens, which account for about 30% of the test set. The results are shown in Table 5. As SANAR can generate tokens in parallel, thus the latency will not be affected by the target lengths, and still keeps comparable with previous results. However, for AR models, as the number of generated tokens increases, the latency increases a lot, and SANAR can achieve $9\times$ speedup. Thus, when completing longer token sequences, the speedup of SANAR will be more significant.

Table 6: Performance of full-line code completion with different sampling probabilities.

p	Python			Java
p	BLEU	EM	ES	BLEU	EM	ES
0	26.78	16.95	65.94	30.56	28.63	65.80
0.15	28.20	16.96	66.36	32.16	29.94	66.81
0.3	29.07	18.35	66.90	32.41	30.57	66.98
0.5	27.37	17.31	66.30	31.29	29.82	66.43

Table 7: Token repetition ratio results.

	Python	Java
Transformer	3.72%	4.36%
CMLM	3.66%	4.92%
GLAT	4.55%	4.56%
SANAR	5.52%	5.40%

4.3 Analysis

Influence of Parameter $p$ . To analyze how the sampling probability $p$ of our Hybrid-Syntax-Guided sampling strategy affects the model’s performance, we conduct experiments with different $p$ s, where a larger $p$ indicates more tokens are sampled depending on their syntax-types. When $p$ is zero, the sampling will become totally random, which is the same with GLAT (Qian et al., 2021). The results for different $p$ s are listed in Table 6. When $p$ is increased from 0 to 0.3, the performance becomes better, demonstrating the effectiveness of the syntax-guided sampling. When $p$ is further increased, where the randomness decreases, the performance began to drop, but still outperforms GLAT. This suggests that it is also necessary to introduce randomness during the syntax-aware sampling, which can help SANAR explore more interdependency among target tokens.

Token Repetition Ratio. In NMT, one of the known problems in non-autoregressive models is the high token repetition ratio, since the interdependency among target tokens is not considered. We are also interested in whether this problem also troubles code completion. We measure the percentage of repeated tokens on the test set of Python and Java benchmarks. The results are shown in Table 7. As seen from the results, the token repetition ratio of the NAR-based models is not increased obviously when compared with the AR-based Transformer model, which is quite different from NMT. These results also demonstrate that the left-to-right interdependency among target tokens in the code completion task is much weaker than that of NMT task. Thus, the non-autoregressive decoder can fit code completion better, and achieve better performance.

5 Related Work

5.1 Code Completion

Since Hindle et al. (2016) have shown that source code is highly repetitive and predictable, language models began to be used for source code modeling and code completion (Hindle et al., 2016; Tu et al., 2014; Hellendoorn and Devanbu, 2017; Li et al., 2018; Liu et al., 2020). Most of the LM-based code completion models perform next one token completion (Li et al., 2018; Liu et al., 2020). Wang et al. (2020) propose to use a Transformer-based LM to perform full-line code completion. In full-line code completion, instead of predicting the next single token, the model predicts a sequence of tokens that form a complete statement given a partially complete code context. The experimental results show that Transformer seq2seq architecture can outperform the LM decoder in Wang et al. (2020) with comparable number of parameters. Thus, we use Transformer seq2seq architecture as the AR baseline, and also employ seq2seq architecture for SANAR instead of only using a decoder as Wang et al. (2020). Guo et al. (2021) proposed GRAMMFORMER, a transformer-based grammar-guided code generation model that generates code sketches, i.e. sequences of code tokens with holes. As the generation target is different from our model, we did not compare with them.

5.2 NAR Models of Text Generation

Fully NAR The Fully Non-AutoRegressive model consists of the same encoder as Transformer and a parallel decoder (Gu et al., 2018). During training, it uses the conditional independent factorization for the target sentence by maximizing the following likelihood:

\mathcal{L}_{NAR}=\log P(Y|X;\theta)=\sum_{t-1}^{T}\log p(y_{t}|X;\theta)

(9)

During inference, the encoder representation is copied as the input for the decoder, therefore all tokens on the target side can be generated in parallel without depending on the previously generated tokens. Recently, Qian et al. (2021) propose GLAT, which can achieve parallel text generation with only a single decoding pass by gradual training.

Iterative NAR The conditional independence assumption in fully NAR does not hold in general, resulting in inferior performance. In order to improve the fully NAR model, multi-pass iterative decoding approaches such as CMLM Ghazvininejad et al. (2019) is proposed, which first predicts all of the target words non-autoregressively, and then repeatedly mask out and regenerate the words that the model is least confident about using the masking scheme.

6 Conclusion

In this paper, we propose SANAR, a syntax-aware non-autoregressive model to improve the efficiency and accuracy of full-line code completion. We perform an in-depth analysis to explore the dependency among the target tokens, and the results show that the dependency is weaker than the assumption in existing code generation models, suggesting that NAR can fit code generation better. Experimental results show that our approach outperforms both NAR and AR baselines, as well as significantly reduces the inference time compared with AR based code completion models. We are the first to apply Non-autoregressive model for generating code. We believe this work represents a significant advance in code generation, which will be beneficial as a building block for many other code generation applications.

References

aiXcoder [2018] aiXcoder. aixcoder. https://www.aixcoder.com/, 2018.
Allamanis and Sutton [2013] Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In MSR, pages 207–216. IEEE Computer Society, 2013.
Copilot [2021] Copilot. Copilot. https://copilot.github.com/, 2021.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
Ghazvininejad et al. [2019] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP/IJCNLP (1), pages 6111–6120. Association for Computational Linguistics, 2019.
Gu et al. [2018] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. Non-autoregressive neural machine translation. In ICLR (Poster). OpenReview.net, 2018.
Guo et al. [2021] Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, and Miltiadis Allamanis. Learning to generate code sketches. arXiv preprint arXiv:2106.10158, 2021.
He et al. [2018] Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Layer-wise coordination between encoder and decoder for neural machine translation. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, pages 7955–7965, 2018.
Hellendoorn and Devanbu [2017] Vincent J Hellendoorn and Premkumar Devanbu. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pages 763–773, 2017.
Hindle et al. [2016] Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. On the naturalness of software. Communications of the ACM, 59(5):122–131, 2016.
Karampatsis et al. [2020] Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. Big code!= big vocabulary: Open-vocabulary models for source code. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pages 1073–1085. IEEE, 2020.
Kasai et al. [2020] Jungo Kasai, James Cross, Marjan Ghazvininejad, and Jiatao Gu. Non-autoregressive machine translation with disentangled context transformer. In International Conference on Machine Learning, pages 5144–5155. PMLR, 2020.
Li et al. [2018] Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. Code completion with neural attention and pointer networks. In IJCAI, pages 4159–4165. ijcai.org, 2018.
Liu et al. [2020] Fang Liu, Ge Li, Bolin Wei, Xin Xia, Zhiyi Fu, and Zhi Jin. A self-attentional neural architecture for code completion with multi-task learning. In Proceedings of the 28th International Conference on Program Comprehension, pages 37–47, 2020.
Ma et al. [2019] Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard H. Hovy. Flowseq: Non-autoregressive conditional sequence generation with generative flow. In EMNLP/IJCNLP (1), pages 4281–4291. Association for Computational Linguistics, 2019.
Qian et al. [2021] Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin Qiu, Weinan Zhang, Yong Yu, and Lei Li. Glancing transformer for non-autoregressive neural machine translation. In ACL/IJCNLP (1), pages 1993–2003. Association for Computational Linguistics, 2021.
Raychev et al. [2016] Veselin Raychev, Pavol Bielik, and Martin T. Vechev. Probabilistic model for code with decision trees. In OOPSLA, pages 731–747. ACM, 2016.
Ren et al. [2020] Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, and Tie-Yan Liu. A study of non-autoregressive model for sequence generation. In ACL, pages 149–159. Association for Computational Linguistics, 2020.
Tabnine [2018] Tabnine. Tabnine. https://www.tabnine.com/, 2018.
Tu et al. [2014] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 269–280, 2014.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
Wang et al. [2020] Wenhan Wang, Sijie Shen, Ge Li, and Zhi Jin. Towards full-line code completion with neural language models. arXiv preprint arXiv:2009.08603, 2020.
Wei et al. [2019] Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and Xu Sun. Imitation learning for non-autoregressive neural machine translation. In ACL (1), pages 1304–1312. Association for Computational Linguistics, 2019.