\ourmodel: multi-mode translation of natural language and \python code with transformers

A.1 Docstring statistics

Figure 1 shows the distributions of various features of docstrings in our corpus. The top row is the distribution of total character-level length of the method signatures (left), docstrings (center), and code bodies. The blue lines are for methods possessing a docstring, and we can see that the vast majority of these methods have docstrings with more than 10 characters. The bottom row shows the distribution of line lengths of the concomitant features from the top row. While the most common line length of docstrings is 1 (comprising 41%), the vast majority of docstrings have multiple lines.

Refer to caption — Figure 1: Histogram of the number of characters (top row) in the \python signatures (left), docstrings (middle), and method body (right). The blue lines are for methods with docstrings, the yellow lines are for methods without docstrings. The vast majority of docstrings have more than 10 characters. The bottom row shows histograms of the number of lines for the same features described in the top row.

A.2 Pre-training details

Figure 3 is the complete training script, using the Facebook AI Research Sequence (FairSeq) modeling library, with which we pre-trained \ourmodel. The data was pre-noised and processed using the fairseq-preprocess command, and placed in the directory indicated by $DIR. The architecture and training hyper-parameters are set in this script. \ourmodel was trained with the same hyperparameters, but with data described in sec.A.4.

Figure 3 shows learning curves of a single seq2seq model of the same architecture as \ourmodel trained only on docstrings, starting from random initializations, and starting from our pre-trained model. As the figure shows, the pre-trained initialization converged to a better validation loss 25 $\times$ faster than the randomly initialized model.

⬇

TOTAL_NUM_UPDATES=1300000

WARMUP_UPDATES=5000

LR=9.1875e-05

MAX_TOKENS=2200

UPDATE_FREQ=64

DIR=<data-dir>

\parfairseq-train $DIR$ \ –max-tokens $MAX_TOKENS \ –task translation \ –source-lang src –target-lang tgt \ –share-all-embeddings \ –share-decoder-input-output-embed \ –arch transformer \ –dropout 0.2 –relu-dropout 0.2 \ –attention-dropout 0.2 \ –encoder-embed-dim 1472 \ –decoder-embed-dim 1472 \ –max-target-positions 1024 \ –max-source-positions 1024 \ –encoder-ffn-embed-dim 4096 \ –decoder-ffn-embed-dim 4096 \ –encoder-attention-heads 8 \ –decoder-attention-heads 8 \ –criterion label_smoothed_cross_entropy \ –label-smoothing 0.1 \ –dropout 0.1 –attention-dropout 0.1 \ –weight-decay 0.01 –optimizer adam \ –clip-norm 0.1 \ –lr-scheduler inverse_sqrt –lr $LR \ –warmup-updates $WARMUP_UPDATES \ –update-freq $UPDATE_FREQ \ –skip-invalid-size-inputs-valid-test \ –save-dir $DIR/models \ –save-interval 16 \ –fp16 –adam-betas ’(0.9,0.98)’ \ –adam-eps 1e-6 \ –tensorboard-logdir $DIR/tensorboard \ –decoder-learned-pos –encoder-learned-pos

Figure 3: The fairseq-train script used to pre-train \ourmodel, setting all the relevant hyper-parameters.

A.3 GPT2 training details

Our GPT2 experiments also used the FairSeq library, with the OpenAI English checkpoint supplied by the HuggingFace library. Figure 4 shows the complete training script, where for the English pre-trained initialization a pre-trained checkpoint was provided. Each models was trained on 4 Tesla V100 GPUs with 16GB of memory each, for 7 days.

⬇

fairseq-train $DIR \ –task language_modeling \ –optimizer adam \ –adam-betas ”(0.9, 0.98)” \ –weight-decay 0.01 \ –clip-norm 0.0 \ –lr 0.0005 \ –reset-optimizer \ –lr-scheduler inverse_sqrt \ –warmup-updates 4000 \ –warmup-init-lr 1e-07 \ –dropout 0.1 \ –weight-decay 0.01 \ –tokens-per-sample 1024 \ –sample-break-mode complete \ –max-tokens 4096 \ –update-freq 4 \ –fp16 \ –arch hf_gpt2_medium \ –max-target-positions 1024 \ –skip-invalid-size-inputs-valid-test

Figure 4: The fairseq-train script we used to train our GPT model baselines

A.4 Multi-mode training details

In order to better teach \ourmodel to understand the relationships between all the different features of code (signatures, docstrings, and bodies) we taught it to translate between all pairs of combinations of these features which do not contain the same feature in both the source and target. In this way, the model can learn to produce method bodies using both signatures and docstrings, or one or the other. Table 1 spells out exactly which combinations were provided to the model as a source and target. For each source example the comment string ‘# target <feature> (<style>)’ was added, instructing the model which feature combination (e.g. signature and body). Only if a docstring was in the target, a style imperative was added, where the styles are defined and discussed in the main text.

Figure 5 shows the training curves for \ourmodel, where the solid black line is the training loss, and all the other curves are the validation loss for each of the tasks indicated in tab. 1. The dashed lines indicate tasks where docstrings are present in the target, showing that these are generally less predictable than code-only targets (as the validation loss is larger). \ourmodelwas trained on 16 Tesla V100 16GB GPUs for 62 epochs, or 5 weeks training time.

		Sources
		Signature	Dosctring	Body	Sig + doc	Sig + body	Doc + body
	Signature		✓	✓			✓
	Docstring	✓		✓		✓
	Body	✓	✓		✓
	Sig + doc			✓
Targets	Sig + body		✓
	Doc + body	✓

Table 1: A table of all possible translation possibilities between the 3 features of a function: the signature (sig), docstring (doc), and body. We train our model to translate between sources and targets indicated with a ✓, which were chosen as all pairs of feature combinations which do not contain the same feature in both the source and target. The system is then instructed to target code bodies when performing function completion.

\ourmodel: multi-mode translation of natural language and \python code with transformers

Appendix A Appendix

A.1 Docstring statistics

A.2 Pre-training details

A.3 GPT2 training details

A.4 Multi-mode training details