Dialogue Generation on Infrequent Sentence Functions via Structured Meta-Learning: Supplementary Material

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
&Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain

1 Model Settings

We take the most frequent 30k words as our vocabulary and use the pretrained embeddings Song2018DirectionalSE for initialization. The sentence function embedding with dimension 20 is randomly initialized and learned through training. We use two-layer LSTMs in both encoder and decoder, and the LSTMs hidden unit size is set to 400. We use dropout Srivastava2014DropoutAS with the probability $p=0.3$ . All trainable parameters, except word embeddings, are randomly initialized with the uniform distribution in $(-0.1,0.1)$ . We adopt the teacher-forcing for the training. In the testing, we select the model with the lowest perplexity and beam search with size 5 is employed for generation. All hyper-parameters and models are selected on the validation dataset.

2 Learning Settings

We use SGD as the optimizer with a minibatch size of 64 and an initial learning rate of 1.0 for both meta-learning (line 9 and line 11 in Algorithm 1) and multi-task learning (We also tried Adam adam but found that SGD performed better.). For meta-learning, we sample 3 tasks for line 3 in Algorithm 1 and take a single gradient step for line 9 and line 11 in Algorithm 1. We meta-train the model for 8 epochs and start having the learning rate after the 3 epoch. All models are fine-tuned with a SGD optimizer with a minibatch size of 64 and learning rate of 0.1. We set the gradient norm upper bound to 3 and 1 during the training and fine-tuning respectively. To avoid any random results, we report the average of five runs for all results.