This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

COARSE-TO-CAREFUL: SEEKING SEMANTIC-RELATED KNOWLEDGE FOR OPEN-DOMAIN COMMONSENSE QUESTION ANSWERING

Abstract

It is prevalent to utilize external knowledge to help machine answer questions that need background commonsense, which faces a problem that unlimited knowledge will transmit noisy and misleading information. Towards the issue of introducing related knowledge, we propose a semantic-driven knowledge-aware QA framework, which controls the knowledge injection in a coarse-to-careful fashion. We devise a tailoring strategy to filter extracted knowledge under monitoring of the coarse semantic of question on the knowledge extraction stage. And we develop a semantic-aware knowledge fetching module that engages structural knowledge information and fuses proper knowledge according to the careful semantic of questions in a hierarchical way. Experiments demonstrate that the proposed approach promotes the performance on the CommonsenseQA dataset comparing with strong baselines.

Index Terms—  Machine Reading Comprehension, Commonsense Knowledge

1 Introduction

Open-domain CommonSense Question Answering (OpenCSQA) is a challenging research topic in AI, which aims at evaluating that if a machine can give the correct answer with the ability to manipulate knowledge like human beings. Questions in OpenCSQA require to be answered with the support of rich background commonsense knowledge. Though it is trivial for human to solve such questions merely based on the question, it is awkward for machines to figure out.

It is notable that emerging Pre-trained Language Models (PLMs), such as BERT [1], can achieve remarkable success on a variety of QA tasks. The outstanding performance of PLMs benefits from the large scale textual corpus [1, 2]. However, commonsense knowledge is often not accessible in the textual form [3]. PLMs also struggle to answer questions that require commonsense knowledge beyond the given textual information from the question. Specifically, they make the judgment with their encapsulated knowledge in unconsciousness manner and emphasize the textual matching between words in the question and answer [4].

There have been several attempts on OpenCSQA regarding leveraging external structured knowledge. Some only incorporate the entity [5] or the relation between words [6, 7] from the knowledge base while ignoring the structural information. Due to the absence of the background document in the OpenCSQA task, it is hard to restrain the boundary of the knowledge to be grounded. Most works search and rank the selected knowledge triples based on the heuristic occurrence score [8, 9] or the score function of the knowledge representation learning [10]. Also, previous works pay no attention to the contextual semantic relevance of the extracted knowledge with the question. Massive external knowledge will bring in noisy and misleading information, and consequently, lead to deviating out of the content of the current question. Therefore, how to introduce knowledge semantically related to questions remains substantially challenging.

To address this issue, we propose a Semantic-driven Knowledge-aware Question Answering (SEEK-QA) framework to manipulate the injection of the relevant knowledge triples in a coarse-to-careful fashion. The advancement of the proposed SEEK-QA framework lies in two folds. First, we design a Semantic Monitoring Knowledge Tailoring (SONAR) strategy which can constrain the selected knowledge triples with the global coarse semantic of the question. It not only denoises irrelevant knowledge input but also benefits upgrading the computing efficiency of the follow-up model. Second, we develop Semantic-aware Knowledge Fetching (SKETCH) module, which capitalizes the careful semantic of question to measure the semantic relevance between the question and knowledge triplets in a hierarchical way.

In this paper, we focus on the typical multiple-choice question answering benchmark CommonsenseQA [11] and consider the structured knowledge base ConceptNet [12] as the external knowledge source. Experiment results demonstrate that SONAR and SKETCH can boost the performance compared with strong baselines. And we exhibit that how injected knowledge influences the judgment of model.

2 Overview of SEEK-QA

In the CommonsenseQA task, given a question 𝐐\mathbf{Q} and a set of candidate answers 𝒜\mathcal{A}, the model \mathcal{M} is asked to select the only one correct answer 𝐀\mathbf{A}^{\ast} from the candidate set. When involving the external knowledge 𝒦\mathcal{K}, the goal of the model can be formalize as: 𝐀=argmaxAi𝒜(Ai|𝐐,𝒦i)\mathbf{A}^{\ast}=\mathop{\arg\max}_{A_{i}\in\mathcal{A}}\mathcal{M}(A_{i}|\mathbf{Q},\mathcal{K}_{i}), where 𝒦i\mathcal{K}_{i} stands for requisite knowledge extracted from ConceptNet of the ii-th candidate answer. Each knowledge in ConceptNet is indicated in a triple k=(cnhead,r,cntail)k=({cn}^{head},r,{cn}^{tail}), where cnheadcn^{head}, cntailcn^{tail} are concepts and rr is relation between them. We denote extracted knowledge graph of 𝒦i\mathcal{K}_{i} as 𝒢={𝒱,}\mathcal{G}=\{\mathcal{V},\mathcal{E}\}, where 𝒱\mathcal{V} represents concepts, and \mathcal{E} stands for relations.

Our SEEK-QA framework, as shown in Fig.1, follows the retrieve-then-answering paradigm that contains two phases: a) retrieve: using the SONAR strategy (Section 3) to filter irrelevant knowledge triples. b) answering: utilizing a QA model equipped with SKETCH module to select the correct answer for a given question.

The main QA model consists of three stages:

Contextual Encoding: We utilize a PLM as the contextual encoder. It takes question 𝐐\mathbf{Q} and each candidate answer AiA_{i} ([CLS]𝐐[SEP]Ai[SEP][\texttt{CLS}]\mathbf{Q}[\texttt{SEP}]A_{i}[\texttt{SEP}]) as input. The contextual representation of the tt-th token is denoted as htcdhh_{t}^{c}\in\mathbb{R}^{d_{h}}, and dhd_{h} is the hidden dimension of the PLMs encoder. We treat the output of [CLS] (h0cdhh_{0}^{c}\in\mathbb{R}^{d_{h}}) as the global semantic representation of the question-answer pair.

Knowledge Module: The knowledge module, SKETCH, is the core of QA model and will be elaborated in Section 4. It is responsible for encoding and fusing knowledge and generating integrated information II for making final prediction.

Answer Scoring: The last step is to calculate the final score for each candidate answer as: score(Ai|𝐐,𝒦i)=MLP(I)\mbox{score}(A_{i}|\mathbf{Q},\mathcal{K}_{i})=\mbox{MLP}(I), where II will be presented in Section 4.2. The final probability for the candidate answer AiA_{i} to be selected can be formulated as: 𝐏(Ai|𝐐,𝒦i)=exp(score(Ai))i=1|𝒜|exp(score(Ai))\mathbf{P}(A_{i}|\mathbf{Q},\mathcal{K}_{i})=\frac{\exp\left(\mbox{score}(A_{i})\right)}{\sum_{i^{\prime}=1}^{\left|\mathcal{A}\right|}\exp\left(\mbox{score}(A_{i^{\prime}})\right)}.

Here, we use cross entropy to calculate the losses between predictions of the model and the ground-truth answer labels.

Refer to caption
Fig. 1: Workflow of Semantic-driven Knowledge-aware Question Answering (SEEK-QA) framework.

3 Semantic Monitoring Knowledge Tailoring

In this section, we present the SONAR strategy for knowledge selecting and filtering.

During knowledge selection, we first convert question and answer into a list of concepts separately, indicated as 𝒞nq\mathcal{C}n^{q} and 𝒞na\mathcal{C}n^{a}. We denote the mm-th concept in question as cnmq𝒞nqcn^{q}_{m}\in\mathcal{C}n^{q} and nn-th concept in answer as cnna𝒞nacn^{a}_{n}\in\mathcal{C}n^{a}. Next, we extract the link paths between question concept and answer concept with the maximal hop to 33. In the meanwhile, we extract link paths between two arbitrary question concepts. After the knowledge selection, we obtain multiple groups of link paths between each pair of concepts cnmq,cnna\langle cn^{q}_{m},cn^{a}_{n}\rangle.

During the knowledge filtration, we first adopt the knowledge embedding algorithm, i.e., TransE [13], on ConceptNet to obtain the knowledge embedding of concepts and relations. Next, the SONAR rates link paths from three aspects: (1) On the link aspect, we follow operation in [10] that defines the score of one link path as the product of validity score of each triple. (2) On the concept aspect, SONAR represents question with GloVe embedding and operates mean pooling over sentence length to get the question representation. Meanwhile, the concept is represented with its GloVe embedding. SONAR strategy utilizes cosine similarity to measure the semantic relevance score between the question representation and each concept representation, and the final link path score is defined as the average of all concepts’ scores. (3) On the relation aspect, SONAR performs the same operation as the concepts, while the relation is represented with its embedding acquired from the knowledge embedding algorithm. Finally, SONAR removes the link path that does not meet up the bond of thresholds on all three scores and preserves the link path satisfying at least two thresholds. After filtering, the reserved link paths are gathered into a knowledge sub-graph 𝒢\mathcal{G} that will be taken as the input for the follow-up QA model.

4 Semantic-aware Knowledge Fetching

In this section, we elaborate on the workflow of the knowledge module, SKETCH, over the acquired knowledge.

4.1 Knoweledge Encoding

Considering that structural relations offer positive semantic information, we employ the relation-aware graph attention network [14, 15] to calculate graph level representation as :

𝒉jg=j=1|Nbj|𝜶j[𝒉^jg;𝒉^jg],𝜶j=exp(𝜷j)j=1|Nbj|exp(𝜷j)\displaystyle\boldsymbol{h}_{j}^{g}=\sum\limits_{j^{\prime}=1}^{|{Nb}_{j}|}\boldsymbol{\alpha}_{j^{\prime}}[\boldsymbol{\hat{h}}_{j}^{g};\boldsymbol{\hat{h}}_{j^{\prime}}^{g}],\boldsymbol{\alpha}_{j}=\frac{\mbox{exp}(\boldsymbol{\beta}_{j})}{\sum_{j^{\prime}=1}^{|{Nb}_{j}|}\mbox{exp}(\boldsymbol{\beta}_{j^{\prime}})} (1)
𝜷j=(Wr𝒓j)tanh(W1𝒉^jg+W2𝒉^jg)\displaystyle\boldsymbol{\beta}_{j}=(W_{r}\boldsymbol{r}_{j})^{\top}\mbox{tanh}(W_{1}\boldsymbol{\hat{h}}_{j}^{g}+W_{2}\boldsymbol{\hat{h}}_{j^{\prime}}^{g}) (2)

where NbjNb_{j} is the set of neighbor nodes of jj-th node, 𝒉^g\boldsymbol{\hat{h}}_{*}^{g} and 𝒉gdg\boldsymbol{h}_{*}^{g}\in\mathbb{R}^{d_{g}} are the concept representations, 𝒓jdr\boldsymbol{r}_{j}\in\mathbb{R}^{d_{r}} is the relation representation, and WrW_{r}, W1W_{1}, W2W_{2} are trainable matrices.

Each link path lkl_{k} is a sequence of concatenation triples as: cn1r1cn2r|lk|1cn|lk|cn_{1}\stackrel{{\scriptstyle r_{1}}}{{\longrightarrow}}cn_{2}\dots\dots\xrightarrow{r_{|l_{k}|-1}}cn_{|l_{k}|}. We employ a heuristic method to encode the kk-th link path with bidirectional GRU [16]: hkjl=BiGRU([hjg;rj;hj+1g])h_{k_{j}}^{l}=\mbox{BiGRU}([{h}_{j}^{g};r_{j};{h}_{j+1}^{g}]), Then, we compute a single vector of the link path representation through mean pooling over its sequential representation: uk=mean(hkl)u_{k}=\mbox{mean}(h_{k}^{l}), which is taken as the knowledge representation of kk-th link path.

4.2 Knowledge Fusion

During fusion phase, SKETCH is equipped with a semantic-aware progressive knowledge fusion mechanism, as shown in Fig. 2, to integrate relevant knowledge considering that different pairs of concepts dedicate diverse impact to question.

We treat the link paths containing the same kk^{\prime}-th concept pair cnmq,cnna\langle cn^{q}_{m},cn^{a}_{n}\rangle as a link group OkO_{k^{\prime}}. For a group of link paths, we calculate the semantic link strength which implies the semantic relevance between a link path lkOkl_{k}\in O_{k^{\prime}} and concepts pair cnmq,cnna\langle cn^{q}_{m},cn^{a}_{n}\rangle as: 𝜶kl=(W3[hmc;hnc])(W4uk)\boldsymbol{\alpha}_{k}^{l}=(W_{3}[h_{m}^{c};h_{n}^{c}])^{\top}(W_{4}u_{k}), where hmch_{m}^{c} is representation from contextual encoder output of cnmqcn_{m}^{q} while hnch_{n}^{c} is for cnnacn_{n}^{a}, and W3W_{3}, W4W_{4} are trainable matrices. Then semantic link strength 𝜶l\boldsymbol{\alpha}^{l} is normalized within its group and assemble the representation of link group OkO_{k^{\prime}} as follows:

Uk=k=1|Ok|𝜷kluk,𝜷kl=exp(𝜶kl)lsOkexp(𝜶sl)U_{k^{\prime}}=\sum_{k=1}^{\left|O_{k^{\prime}}\right|}\boldsymbol{\beta}_{k}^{l}u_{k},\boldsymbol{\beta}_{k}^{l}=\frac{\mbox{exp}(\boldsymbol{\alpha}_{k}^{l})}{\sum_{l_{s}\in O_{k^{\prime}}}\mbox{exp}(\boldsymbol{\alpha}_{s}^{l})} (3)

Among different pairs of concepts, semantic union strength is designed to fetch semantic relevance between a pair of concepts and the global question semantic which expounds how well the concept pair contributes to the question. The semantic union strength is calculated as follows:

𝜶kc=(W5h0c)(W6[hmg;hng;hmc;hnc])\boldsymbol{\alpha}_{k^{\prime}}^{c}=(W_{5}h_{0}^{c})^{\top}(W_{6}[h^{g}_{m};h^{g}_{n};h_{m}^{c};h_{n}^{c}]) (4)

where hmgh^{g}_{m} is the graph-level representation of cnmqcn_{m}^{q}, hngh^{g}_{n} is the graph-level representation of cnnacn_{n}^{a}, and W5W_{5}, W6W_{6} are trainable weight matrics. Then the semantic union strength 𝜶c\boldsymbol{\alpha}^{c} is normalized as: 𝜷kc=exp(𝜶kc)/s=1|𝒞nq,𝒞na|exp(𝜶kc)\boldsymbol{\beta}_{k^{\prime}}^{c}=\exp(\boldsymbol{\alpha}_{k^{\prime}}^{c})/\sum_{s^{\prime}=1}^{\left|\langle\mathcal{C}n^{q},\mathcal{C}n^{a}\rangle\right|}\exp(\boldsymbol{\alpha}_{k^{\prime}}^{c}).

Combining the semantic union strength and semantic link strength, we can obtain the final semantic-aware knowledge representation as: Vk=k=1|𝒞nq,𝒞na|𝜷kcF([hmg;hng;Uk])V^{k}=\sum_{k^{\prime}=1}^{\left|\langle\mathcal{C}n^{q},\mathcal{C}n^{a}\rangle\right|}\boldsymbol{\beta}_{k^{\prime}}^{c}\cdot F\left([h^{g}_{m};h^{g}_{n};U_{k^{\prime}}]\right), where VkdkV^{k}\in\mathbb{R}^{d_{k}}, F()F(\cdot) is 11-layer feed-forward network.

For candidate answer AiA_{i}, we extract its graph-level representation hngh_{n}^{g} as knowledgeable representation VaV^{a}. If one candidate answer contains more than one concept, we calculate mean pooling of graph-level representation of concepts.

In the last, SKETCH employs a selective gate, which gathers semantic-aware knowledge representation, graph-level knowledgeable representation of candidate answer, global question semantic representation together, to construct the final output for the answer scoring module as follows:

V=F([Vk;Va]),z=σ(Wz[h0c;V])\displaystyle V=F([V^{k};V^{a}]),z=\sigma(W_{z}[h_{0}^{c};V]) (5)
I=zF(h0c)+(1z)V\displaystyle I=z\cdot F(h_{0}^{c})+(1-z)\cdot V (6)

where WzW_{z} is a trainable matrix. The zz controls selective merging information from external knowledge.

Refer to caption
Fig. 2: Progressive knowledge fusion mechanism inside the SKETCH. GS: global semantic representation. Blue circles represent concepts in question, while pinks correspond to concepts in answer. Arrow lines in red indicate link strength. The thicker the line, the higher the strength. Purple rectangles indicate union strength. The darker, the higher the strength.

5 Experiments

5.1 Dataset and Knowledge Source

We examine our proposed approach on CommonsenseQA dataset [11], a multiple choices question answering dataset. Each example consists of 11 question and 55 candidate answers. Because of the test set is unpublicized, we randomly split official development set into one inner development set (60%60\%) and one inner test set (40%40\%) for conducting model selection and ablation studies. The results are evaluated with accuracy. We consider ConceptNet as our external knowledge source. After removing non-English concepts, there are 2,487,8092,487,809 triples with 799,272799,272 concepts and 4040 relations. These triples are taken as the input to the knowledge extraction (SONAR). The thresholds in SONAR for link/concept/ relation are set to 0.150.15/0.30.3/0.350.35 and max hop is set to 22. Refer to Table 1 for the detailed statistics of obtained link paths.

Num.
Orig. 3
Hop 3
Hop 2
Hop 1
Total L 41148k 15634k 1547k 52k
Avg. L1 845 321 32 1
Avg. CP 10 10 10 10
Avg. L2 84 31 3 <1
Table 1: Statistics of the extracted link paths with SONAR with different max hops. L: link paths. CP: numbers of concept pairs of one QA pair. L1/L2: numbers of links of one QA pair/pair of concepts. Orig.33: link paths without filtering.

5.2 Experiment Setups

We implement our approach with Pytorch [17] framework. We employ RoBERTa large [2] as contextual encoder. Dimension of embeddings of concepts and relations is set to 100100. The layers of graph attention are 22 with the both hidden size of 100100. The hidden size of one-layer BiGRU for link path encoding is 150150. Max length of textual input is 100100. We adopt Adam [18] optimizer with initial learning rate of 0.000010.00001. We train our model for 12001200 steps with a batch size of 2424.

Group Model Acc. (%)
1 BERT-large 63.64
BERT-wwm 65.40
RoBERTa-large 71.44
2 KagNet [10] 64.46
ours SEEK-QA 74.52
Table 2: The performance on CommonsenseQA.

5.3 Results

The main results of baseline models and ours on CommonsenseQA are shown in Table 2. We compare our model with two different groups of models on the CommonsenseQA task. Group 11: models with inner encapsulated knowledge, which including the PLMs, such as GPT [19] and BERT [1]. Group 22: models with explicit structured knowledge following the retrieve-then-answering paradigm. Our SEEK-QA achieves a promising improvement over baselines. For Group 11, it is obvious that our approach promotes the performance of PLMs through introducing appropriate knowledge. For Group 22, our approach also exhibits its advantage from the results.

6 Analysis and Discussion

Acc. (%) Hop=3 Hop=2 Hop=1
SONAR 73.21 74.52 75.10
w/o SC 72.48 72.64 73.28
w/o filter 73.62 75.14
Table 3: Ablation studies on knowledge range on dev set. SC: semantic constraint scores of concept/relation in Section 3.
Acc. (%) Hop=2 Hop=1
sDev sTest sDev sTest
SKETCH 74.04 76.27 74.59 75.86
GATL=1 70.90 74.84 73.90 75.01
GATL=3 72.95 75.25 71.72 72.59
w/o GAT 70.49 71.98 72.54 73.41
w/o SLS 72.40 75.25 71.17 73.41
w/o SUS 70.49 73.00 71.72 73.64
Table 4: Ablation studies on SKETCH model. GATL: Layers of Graph Attention network (GAT). SLS: Semantic Link Strength. SUS: Semantic Union Strength.

(1) Ablation on the Range of External Knoweledge: We compare the impact of ranges of introduced knowledge extracting with SONAR strategy or filtering knowledge without contextual semantic constraints. As shown in Table 3, filtering knowledge triplets with SONAR results in a better performance comparing with removing semantic constraint. We argue that decreasing the input of irrelevant noisy knowledge can make QA model more focused. Comparing with removing filtering111i.e. All extracted knowledge is taken as the input to QA model., SONAR strategy also shows an advantage on a wider range of knowledge (Hop=22,11). That reveals SONAR strategy can help to attract more suitable knowledge for the task and be of benefit to the task performance. (2) Ablation on SKETCH Component: As shown in Table 4, the performance first increases then decays along with the increasing of GATL, and drops when we remove the GAT knowledge encoding. We can assume that GAT can facilitate model to gather information from structured knowledge. When SLS and SUS operations are removed respectively, the performance both gets a decline. It indicates semantic relevance strength helps to distinguish the worth of knowledge triplets. (3) Case Study: As shown in Fig.3, RoBERTa fails in these cases, but our approach makes a correct prediction with the support of the closely related knowledge links. It is notable that the amount of links is greatly reduced with the SONAR strategy while core knowledge links are held. With such requisite knowledge taken as input, the SKETCH model figure out correct answers with great confidence.

Refer to caption
Fig. 3: Case study. Kn. contains selected knowledge through SONAR and held/original links amount of QA pair. The correct answer is in bold with pink.

7 Conclusion

In this work, we propose the Semantic-driven Knowledge-aware Question Answering (SEEK-QA) framework, which can manipulate the injection of external structured knowledge in the light of a coarse-to-careful fashion. Experiment results demonstrate the effectiveness of the proposed approach.

Acknowledgments

We thank the reviewers for their advice. This work was supported by in part by the National Key Research and Development Program of China under Grant No.2016YFB0801003.

References

  • [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT. 2019, pp. 4171–4186, Association for Computational Linguistics.
  • [2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
  • [3] Ernest Davis and Gary Marcus, “Commonsense reasoning and commonsense knowledge in artificial intelligence,” Commun. ACM, vol. 58, no. 9, pp. 92–103, 2015.
  • [4] Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing Jiang, “What does BERT learn from multiple-choice reading comprehension datasets?,” CoRR, vol. abs/1910.12391, 2019.
  • [5] An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li, “Enhancing pre-trained language representations with rich knowledge for machine reading comprehension,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 2346–2357, Association for Computational Linguistics.
  • [6] Liang Wang, Meng Sun, Wei Zhao, Kewei Shen, and Jingming Liu, “Yuanfudao at SemEval-2018 task 11: Three-way attention and relational knowledge for commonsense machine comprehension,” in Proceedings of The 12th International Workshop on Semantic Evaluation, 2018, pp. 758–762.
  • [7] Chao Wang and Hui Jiang, “Explicit utilization of general knowledge in machine reading comprehension,” in Proceedings of the 57th ACL, Florence, Italy, 2019, pp. 2263–2272, Association for Computational Linguistics.
  • [8] Todor Mihaylov and Anette Frank, “Knowledgeable reader: Enhancing cloze-style reading comprehension with external commonsense knowledge,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 821–832, Association for Computational Linguistics.
  • [9] Lisa Bauer, Yicheng Wang, and Mohit Bansal, “Commonsense for generative multi-hop question answering tasks,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 4220–4230, Association for Computational Linguistics.
  • [10] Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren, “KagNet: Knowledge-aware graph networks for commonsense reasoning,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 2822–2832, Association for Computational Linguistics.
  • [11] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant, “CommonsenseQA: A question answering challenge targeting commonsense knowledge,” in NAACL-HLT, Minneapolis, Minnesota, 2019, pp. 4149–4158, Association for Computational Linguistics.
  • [12] Robyn Speer, Joshua Chin, and Catherine Havasi, “Conceptnet 5.5: An open multilingual graph of general knowledge,” in AAAI, 2017, pp. 4444–4451.
  • [13] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko, “Translating embeddings for modeling multi-relational data,” in Advances in Neural Information Processing Systems, 2013, pp. 2787–2795.
  • [14] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio, “Graph attention networks,” CoRR, vol. abs/1710.10903, 2017.
  • [15] Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu, “Commonsense knowledge aware conversation generation with graph attention,” in IJCAI, 2018, pp. 4623–4629.
  • [16] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 1724–1734, Association for Computational Linguistics.
  • [17] Adam Paszke, Sam Gross, and Adam Lerer, “Automatic differentiation in pytorch,” 2017.
  • [18] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [19] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, “Improving language understanding by generative pre-training,” 2018.