Response Letter for the EMNLP Submission

Updates in Response to the EMNLP Reviewers

We have proofread our previous submission, enhanced the notations, corrected the typos and grammatical errors, and improved clarity and presentation of our paper. Additionally, we have updated our submission to incorporate helpful suggestions by the EMNLP reviewers. We summarise these changes below:

New Sections

We have added new sections to the main paper and our appendix. Additionally, we also included more details to some sections to answer the concerns of the reviewers:

1.

To answer the first reviewer’s question, reasons for choosing the thresholds for selecting the sparse relations is added to Appendix B.
2.

To address the third reviewer’s question, a subsection (Discussion) is added to Section 4.3 to explain why our model outperforms several baselines
3.

To address the third reviewer’s question, a more detailed discussion for each baseline is added. We have tried to provide justifications for improvements over each baseline in Appendix D.

Additional Experiments

Following the third reviewer’s suggestion on the significance of the LSTM module in the Matching Network, we performed extra ablation studies. We also added ablation studies on GDELT dataset. The new results are summarized in Table 3. Section 4.4 also contains more detailed discussion about different variations of the model to provide a deeper understanding about functionality of our method.

Model Updates

The third reviewer suggested to perform additional ablation study on the LSTM module (please refer to the reviews we have included). After performing this experiment, we realizes that our model performs better when a simpler similarity module is used. Therefore, we removed the LSTM module from the similarity network which resulted in changes of Section 3.2 and subsequent performance improvements.

Reviews

We list below the original review and our responses during the EMNLP rebuttal phase. Please note that most of the changes that we listed above, performed after the EMNLP rebuttal phase for the current AAAI submission.

Meta Reviewer

This paper presents an approach for one-shot learning in temporal graphs, where training and test examples for relation prediction tasks on the graph come from different time periods. The choice of problem is relevant, and the model outperforms reasonable baselines in terms of empirical evaluation. However, the reviewers raise concerns about a lack of analysis about why the method is effective, unsubstantiated design choices and some problems with the technical presentation (the authors are right in pointing out that writing style should not be the reason for accepting/rejecting a paper). We suggest that the paper could become much stronger by incorporating these changes.
Answer. we have preformed additional ablation studies to address this concern which also led to improvements in our performance in additional to providing better insight behind performance of our method.

Reviewer 1

Question. How did you determine the threshold of 50 and 500 for relations in ICEWS and 50 700 for GDELT?
Answer. There is a trade-off in choosing the threshold values. The threshold for choosing the sparse relations should be selected such that the sparsity is preserved and also we have enough data for training the model. We have selected the exact threshold values based on the prior work for fair comparison (please see Xiong e. al. 2018) which also is based on the above rationale. Note that GDELT is less sparse than ICEWS, i.e., there are fewer relations in the 50 ¡ ¡ 500 interval. We increased the upper threshold to increase the number of the tasks.

Reviewer 2

Question. Why do you need $n_{heads}$ and $n_{layer}$ as inputs to Att(..)?
Answer. As explained in line 373, the main component of function Att is a multihead attention sublayer followed by a position-wise sublayer. The multihead attention sublayer can have $n_{heads}$ . Multihead attention + position-wise sublayer can be repeated $n_{layer}$ times. The answer to this question could be inferred from the text and more information can be found in Vaswani et al. 2017. he

Question. If the input to (7) is $x=\{x_{t-l}^{e},...,x_{t-1}^{e}\}$ , how do you get $z_{t-l}$ in (8)?
Answer. As described in the line 393, Att function maps input sequence $x$ to a time-aware sequence output $z$ . Since sequence x is defined as $x=\{x_{t-l}^{e},...,x_{t-1}^{e}\}$ , sequence $z$ can be defined similarly as $z=\{z_{t-l}^{e},...,z_{t-1}^{e}\}$ . We added the definition of $z$ to the text.

Question. How are $score^{+}$ and $score^{-}$ related to $score_{a+1}$ in (9)?
Answer. For every quadruple in $Q^{+}_{r}$ , a score is calculated using (9), which is called $score^{+}$ . For every quadruple in $Q^{+}_{r}$ , there is a corresponding negative quadruple where $score^{-}$ is calculated for it.

Typos and Definition.

1.

$hd_{v}$ is not defined in line 381.
A. Note that $hd_{v}$ is $h$ multiplied by $d_{v}$ , where $h$ is the number of heads indicated in line 377.
2.

$x_{\tau}^{e}$ in (4) is not defined.
A.Equation (4) is defining $x_{\tau}$ , where $x_{\tau}$ appears for the first time. It will later be used in the next section.
3.

s is not defined in line 114.
A. In line 163 it is mentioned that $\mathcal{E}$ is the set of entities and later in the paragraph we mention that KG completion is to predict a link between subject (s) and object (o) entity.
4.

$\mathcal{T}$ is overloaded, is both a set of tasks and a set of sparse relations.
A. The set of sparse relations in fact is the set of tasks in our framework. That’s why the same symbol is used for it.
5.

$P_{\theta}$ , $y$ and $x$ in (1) are not defined.
A. $P_{\theta}$ is the conditional probability parameterized by $\theta$ : given an instance $x$ and $a$ support set, what is the probability of $y$ .
6.

Are $\mathcal{P}_{\theta}$ and $P_{\theta}$ the same?
A. The symbols used in the background section are meant to be more general just to define the background knowledge. $P_{\theta}$ and $\mathcal{P}_{\theta}$ are intuitively similar, although $\mathcal{P}_{\theta}$ is specific to our model, which is explained in line 239.
7.

$k$ in (2) is not defined.
A. k is removed from the paper.
8.

is : in (4) a concatenation operator?
A. It is concatenation and we clarified it in the text.
9.

Figure 1(a) is not referenced in the text.
A. This Figure has been removed from the paper due to the space limits.
10.

what is $h_{a+1}^{\prime}:q$ in (9)?
A. Since we have removed Matching Networks as the similarity metric, this part doesn’t exist in the paper anymore.
11.

$d_{1},d_{2}$ and $d_{out}$ in line 400 are not defined.
A. $d_{1}+d_{2}$ was a typo. we fixed it and changed it to $2\ell d$ .

Number 1-6 could be inferred from the text and did not required major changes in the writing. The rest are addressed.

Reviewer 3

Question. Despite the better performance of the proposed model, the reasons behind the improvements are not very clear to the readers. It would benefit the audience a lot if the authors could compare the differences between their model and the baselines, and give insights on why these differences lead to better results.
Answer. We have addressed this part by adding a Discussion section at the end of Section 4.3. Also, we have added a more detailed discussion on each method in the appendix.

Question. Using an LSTM to match the query and the support relation is not well-justified. Why an MLP cannot do the job? This also seems unclear to me.
Answer. We conducted more ablation studies on LSTM module in Matching Network. The results showed that although Matching Networks could be effective in some variation of the model with fewer parameters, it actually causes over-parameterization is some other variations. We have added the results of the ablation studies and a discussion on the results to Section 4.4. Based on the results of our study, our full model pipeline no longer uses LSTM in its architecture.