This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\ourmodel: multi-mode translation of natural language and \python  code with transformers

Appendix A Appendix

A.1 Docstring statistics

Figure 1 shows the distributions of various features of docstrings in our corpus. The top row is the distribution of total character-level length of the method signatures (left), docstrings (center), and code bodies. The blue lines are for methods possessing a docstring, and we can see that the vast majority of these methods have docstrings with more than 10 characters. The bottom row shows the distribution of line lengths of the concomitant features from the top row. While the most common line length of docstrings is 1 (comprising 41%), the vast majority of docstrings have multiple lines.

Refer to caption
Figure 1: Histogram of the number of characters (top row) in the \python signatures (left), docstrings (middle), and method body (right). The blue lines are for methods with docstrings, the yellow lines are for methods without docstrings. The vast majority of docstrings have more than 10 characters. The bottom row shows histograms of the number of lines for the same features described in the top row.

A.2 Pre-training details

Figure 3 is the complete training script, using the Facebook AI Research Sequence (FairSeq) modeling library, with which we pre-trained \ourmodel. The data was pre-noised and processed using the fairseq-preprocess command, and placed in the directory indicated by $DIR. The architecture and training hyper-parameters are set in this script. \ourmodel  was trained with the same hyperparameters, but with data described in sec.A.4.

Figure 3 shows learning curves of a single seq2seq model of the same architecture as \ourmodel  trained only on docstrings, starting from random initializations, and starting from our pre-trained model. As the figure shows, the pre-trained initialization converged to a better validation loss 25×\times faster than the randomly initialized model.

Refer to caption
Figure 2: Learning curves for training a sequence-to-sequence transformer, translating from python method definitions to their docstrings. Blue curves represent the training and validation loss, and show that convergence (validation loss stops decreasing) occurs after 3.97×1053.97\times 10^{5} steps or 183 epochs. The optimization of the pre-trained model with identical hyperparameters reaches and beats the best validation loss at 1.5×1041.5\times 10^{4} steps or 7 epochs.
TOTAL_NUM_UPDATES=1300000
WARMUP_UPDATES=5000
LR=9.1875e-05
MAX_TOKENS=2200
UPDATE_FREQ=64
DIR=<data-dir>
\parfairseq-train $DIR$ \ max-tokens $MAX_TOKENS \ task translation \ source-lang src target-lang tgt \ share-all-embeddings \ share-decoder-input-output-embed \ arch transformer \ dropout 0.2 relu-dropout 0.2 \ attention-dropout 0.2 \ encoder-embed-dim 1472 \ decoder-embed-dim 1472 \ max-target-positions 1024 \ max-source-positions 1024 \ encoder-ffn-embed-dim 4096 \ decoder-ffn-embed-dim 4096 \ encoder-attention-heads 8 \ decoder-attention-heads 8 \ criterion label_smoothed_cross_entropy \ label-smoothing 0.1 \ dropout 0.1 attention-dropout 0.1 \ weight-decay 0.01 optimizer adam \ clip-norm 0.1 \ lr-scheduler inverse_sqrt lr $LR \ warmup-updates $WARMUP_UPDATES \ update-freq $UPDATE_FREQ \ skip-invalid-size-inputs-valid-test \ save-dir $DIR/models \ save-interval 16 \ fp16 adam-betas ’(0.9,0.98)’ \ adam-eps 1e-6 \ tensorboard-logdir $DIR/tensorboard \ decoder-learned-pos encoder-learned-pos
Figure 3: The fairseq-train script used to pre-train \ourmodel, setting all the relevant hyper-parameters.

A.3 GPT2 training details

Our GPT2 experiments also used the FairSeq library, with the OpenAI English checkpoint supplied by the HuggingFace library. Figure 4 shows the complete training script, where for the English pre-trained initialization a pre-trained checkpoint was provided. Each models was trained on 4 Tesla V100 GPUs with 16GB of memory each, for 7 days.

fairseq-train $DIR \ task language_modeling \ optimizer adam \ adam-betas ”(0.9, 0.98)” \ weight-decay 0.01 \ clip-norm 0.0 \ lr 0.0005 \ reset-optimizer \ lr-scheduler inverse_sqrt \ warmup-updates 4000 \ warmup-init-lr 1e-07 \ dropout 0.1 \ weight-decay 0.01 \ tokens-per-sample 1024 \ sample-break-mode complete \ max-tokens 4096 \ update-freq 4 \ fp16 \ arch hf_gpt2_medium \ max-target-positions 1024 \ skip-invalid-size-inputs-valid-test
Figure 4: The fairseq-train script we used to train our GPT model baselines

A.4 Multi-mode training details

In order to better teach \ourmodel  to understand the relationships between all the different features of code (signatures, docstrings, and bodies) we taught it to translate between all pairs of combinations of these features which do not contain the same feature in both the source and target. In this way, the model can learn to produce method bodies using both signatures and docstrings, or one or the other. Table 1 spells out exactly which combinations were provided to the model as a source and target. For each source example the comment string ‘# target <feature> (<style>)’ was added, instructing the model which feature combination (e.g. signature and body). Only if a docstring was in the target, a style imperative was added, where the styles are defined and discussed in the main text.

Figure 5 shows the training curves for \ourmodel, where the solid black line is the training loss, and all the other curves are the validation loss for each of the tasks indicated in tab. 1. The dashed lines indicate tasks where docstrings are present in the target, showing that these are generally less predictable than code-only targets (as the validation loss is larger). \ourmodelwas trained on 16 Tesla V100 16GB GPUs for 62 epochs, or 5 weeks training time.

Sources

Signature

Dosctring

Body

Sig + doc

Sig + body

Doc + body

Signature
Docstring
Body
Sig + doc

Targets

Sig + body
Doc + body
Table 1: A table of all possible translation possibilities between the 3 features of a function: the signature (sig), docstring (doc), and body. We train our model to translate between sources and targets indicated with a ✓, which were chosen as all pairs of feature combinations which do not contain the same feature in both the source and target. The system is then instructed to target code bodies when performing function completion.
Refer to caption
Figure 5: Learning curve for the multi-mode training, where the black line is the training loss, and the other lines are the validation loss for each mode of translation. Dashed lines indicate the docstrings are in the target, solid lines have only code in the target.