This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Recover Reasoning Chains for
Multi-Hop Question Answering via Cooperative Games

Yufei Feng   Mo Yu   Wenhan Xiong   Xiaoxiao Guo   Junjie Huang§
Shiyu Chang   Murray Campbell   Michael Greenspan   Xiaodan Zhu
Queen’s University   IBM Research  UC Santa Barbara  § Beihang University
[email protected][email protected]
Abstract

We propose the new problem of learning to recover reasoning chains from weakly supervised signals, i.e., the question-answer pairs. We propose a cooperative game approach to deal with this problem, in which how the evidence passages are selected and how the selected passages are connected are handled by two models that cooperate to select the most confident chains from a large set of candidates (from distant supervision). For evaluation, we created benchmarks based on two multi-hop QA datasets, HotpotQA and MedHop; and hand-labeled reasoning chains for the latter. The experimental results demonstrate the effectiveness of our proposed approach.

1 Introduction

NLP tasks that require multi-hop reasoning have recently enjoyed rapid progress, especially on multi-hop question answering Ding et al. (2019); Nie et al. (2019); Asai et al. (2019). Advances have benefited from rich annotations of supporting evidence, as in the popular multi-hop QA and relation extraction benchmarks, e.g., HotpotQA Yang et al. (2018) and DocRED Yao et al. (2019), where the evidence sentences for the reasoning process were labeled by human annotators.

Such evidence annotations are crucial for modern model training, since they provide finer-grained supervision for better guiding the model learning. Furthermore, they allow a pipeline fashion of model training, with each step, such as passage ranking and answer extraction, trained as a supervised learning sub-task. This is crucial from a practical perspective, in order to reduce the memory usage when handling a large amount of inputs with advanced, large pre-trained models Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019).

Manual evidence annotation is expensive, so there are only a few benchmarks with supporting evidence annotated. Even for these datasets, the structures of the annotations are still limited, as new model designs keep emerging and they may require different forms of evidence annotations. As a result, the supervision from these datasets can still be insufficient for training accurate models.

Refer to caption
Figure 1: An example of reasoning chains in HotpotQA (2-hop) and MedHop (3-hop). HotpotQA provides only supporting passages {P3,P9}\{P_{3},P_{9}\}, without order and linking information.

Taking question answering with multi-hop reasoning as an example, annotating only supporting passages is not sufficient to show the reasoning processes due to the lack of necessary structural information (Figure 1). One example is the order of annotated evidence, which is crucial in logic reasoning and the importance of which has also been demonstrated in text-based QA Wang et al. (2019). The other example is how the annotated evidence pieces are connected, which requires at least the definition of arguments, such as a linking entity, concept, or event. Such information has proved useful by the recently popular entity-centric methods De Cao et al. (2019); Kundu et al. (2019); Xiao et al. (2019); Godbole et al. (2019); Ding et al. (2019); Asai et al. (2019) and intuitively will be a benefit to these methods if available.

We propose a cooperative game approach to recovering the reasoning chains with the aforementioned necessary structural information for multi-hop QA. Each recovered chain corresponds to a list of ordered passages and each pair of adjacent passages is connected with a linking entity. Specifically, we start with a model, the Ranker, which selects a sequence of passages arriving at the answers, with the restriction that each adjacent passage pair shares at least an entity. This is essentially an unsupervised task and the selection suffers from noise and ambiguity. Therefore we introduce another model, the Reasoner, which predicts the exact linking entity that points to the next passage. The two models play a cooperative game and are rewarded when they find a consistent chain. In this way, we restrict the selection to satisfy not only the format constraints (i.e., ordered passages with connected adjacencies) but also the semantic constraints (i.e., finding the next passage given that the partial selection can be effectively modeled by a Reasoner). Therefore, the selection can be less noisy.

We evaluate the proposed method on datasets with different properties, i.e., HotpotQA and MedHop Welbl et al. (2018), to cover cases with both 2-hop and 3-hop reasoning. We created labeled reasoning chains for both datasets.111We will release our code and labeled evaluation data. Experimental results demonstrate the significant advantage of our proposed approach.

2 Task Definition

Reasoning Chains Examples of reasoning chains in HotpotQA and MedHop are shown in Figure 1. Formally, we aim at recovering the reasoning chain in the form of (p1e1,2p2e2,3en1,npn)(p_{1}\rightarrow e_{1,2}\rightarrow p_{2}\rightarrow e_{2,3}\rightarrow\cdots\rightarrow e_{n-1,n}\rightarrow p_{n}), where each pip_{i} is a passage and each ei,i+1e_{i,i+1} is an entity that connects pip_{i} and pi+1p_{i+1}, i.e., appearing in both passages. The last passage pnp_{n} in the chain contains the correct answer. We say pip_{i} connects ei1,ie_{i-1,i} and ei,i+1e_{i,i+1} in the sense that it describes a relationship between the two entities.

Our Task Given a QA pair (q,a)(q,a) and all its candidate passages 𝒫\mathcal{P}, we can extract all possible candidate chains that satisfy the conditions mentioned above, denoted as 𝒞\mathcal{C}. The goal of reasoning chain recovery is to extract the correct chains from all the candidates, given q,aq,a and 𝒫\mathcal{P} as inputs.

Related Work Although there are recent interests on predicting reasoning chains for multi-hop QA Ding et al. (2019); Chen et al. (2019); Asai et al. (2019), they all consider a fully supervised setting; i.e., annotated reasoning chains are available. Our work is the first to recover reasoning chains in a more general unsupervised setting, thus falling into the direction of denoising over distant supervised signals. From this perspective, the most relevant studies in the NLP field includes Wang et al. (2018); Min et al. (2019) for evidence identification in open-domain QA and Lei et al. (2016); Perez et al. (2019); Yu et al. (2019) for rationale recovery.

3 Method

Refer to caption
Figure 2: Model overview. The cooperative Ranker and Reasoner are trained alternatively. The Ranker selects a passage pp at each step conditioned on the question qq and history selection, and receives reward r1r_{1} if pp is evidence. Conditioned on qq, the Reasoner predicts which entity from pp links to the next evidence passage. The Ranker receives extra reward r2r_{2} if its next selection is connected by the entity predicted by the Reasoner. Both qq and answer aa are model inputs. While qq is fed to the Ranker/Reasoner as input, empirically the best way of using aa is for constructing the candidate set thus computing the reward r1r_{1}. We omit the flow from qq/aa for simplicity.

The task of recovering reasoning chains is essentially an unsupervised problem, as we have no access to annotated reasoning chains. Therefore, we resort to the noisy training signal from chains obtained by distant supervision. We first propose a conditional selection model that optimizes the passage selection by considering their orders (Section 3.1). We then propose a cooperative Reasoner-Ranker game (Section 3.2) in which the Reasoner recovers the linking entities that point to the next passage. This enhancement encourages the Ranker to select the chains such that their distribution is easier for a linking entity prediction model (Reasoner) to capture. Therefore, it enables our model to denoise the supervision signals while recovering chains with entity information. Figure 2 gives our overall framework, with a flow describing how the Reasoner passes additional rewards to the Ranker.

3.1 Passage Ranking Model

The key component of our framework is the Ranker model, which is provided with a question qq and KK passages 𝒫={p1,p2pK}\mathcal{P}=\{p_{1},p_{2}...p_{K}\} from a pool of candidates, and outputs a chain of selected passages.

Passage Scoring

For each step of the chain, the Ranker estimates a distribution of the selection of each passage. To this end we first encode the question and passage with a 2-layer bi-directional GRU network, resulting in an encoded question 𝑸={𝒒0,𝒒1,,𝒒N}\bm{Q}=\{\vec{\bm{q}_{0}},\vec{\bm{q}_{1}},...,\vec{\bm{q}_{N}}\} and 𝑯i={𝒉i,0,𝒉i,1,,𝒉i,Mi}\bm{H}_{i}=\{\vec{\bm{h}_{i,0}},\vec{\bm{h}_{i,1}},...,\vec{\bm{h}_{i,M_{i}}}\} for each passage piPp_{i}\in P of length MiM_{i}. Then we use the MatchLSTM model Wang and Jiang (2016) to get the matching score between 𝑸\bm{Q} and each 𝑯i\bm{H}_{i} and derive the distribution of passage selection P(pi|q)P(p_{i}|q) (see Appendix A for details). We denote P(pi|q)=MatchLSTM(𝑯i,𝑸)P(p_{i}|q)=\textrm{MatchLSTM}(\bm{H}_{i},\bm{Q}) for simplicity.

Conditional Selection

To model passage dependency along the chain of reasoning, we use a hard selection model that builds a chain incrementally. Provided with the KK passages, at each step tt the Ranker computes Pt(pi|𝑸t1),i=0,,KP^{t}(p_{i}|\bm{Q}^{t-1}),i=0,...,K, which is the probability of selecting passage pip_{i} conditioned on the query and previous states representation 𝑸t1\bm{Q}^{t-1}. Then we sample one passage pτtp^{t}_{\tau} according to the predicted selection probability.

pτt\displaystyle p^{t}_{\tau} =Sampling(𝑷t)\displaystyle=\textrm{Sampling}(\bm{P}^{t}) (1)
𝑸t\displaystyle\bm{Q}^{t} =FFN([𝑸t1,𝒎~pτt])\displaystyle=\text{FFN}([\bm{Q}^{t-1},\tilde{\bm{m}}^{t}_{p_{\tau}}])
𝑷t+1(pi|𝑸t)\displaystyle\bm{P}^{t+1}(p_{i}|\bm{Q}^{t}) =MatchLSTM(𝒑i,𝑸t),\displaystyle=\textrm{MatchLSTM}(\bm{p}_{i},\bm{Q}^{t}),

The first step starts with the original question 𝑸0\bm{Q}^{0}. A feed-forward network is used to project the concatenation of query encoding and selected passage encoding 𝒎~pτt\tilde{\bm{m}}^{t}_{p_{\tau}} back to the query space, and the new query 𝑸t+1\bm{Q}^{t+1} is used to select the next passage.

Reward via Distant Supervision

We use policy gradient Williams (1992) to optimize our model. As we have no access to annotated reasoning chains during training, the reward comes from distant supervision. Specifically, we reward the Ranker if a selected passage appears as the corresponding part of a distant supervised chain in 𝒞\mathcal{C}. The model receives immediate reward at each step of selection.

In this paper we only consider chains consist of 3\leq 3 passages (2-hop and 3-hop chains).222It has been show that 3\leq 3-hops can cover most real-world cases, such KB reasoning Xiong et al. (2017); Das et al. (2018). For the 2-hop cases, our model predicts a chain of two passages from the candidate set 𝒞\mathcal{C} in the form of pheptp_{h}\rightarrow e\rightarrow p_{t}. Each candidate chain satisfies that ptp_{t} contains the answer, while php_{h} and ptp_{t} contain a shared entity ee. We call php_{h} the head passage and ptp_{t} the tail passage. Let 𝒫T/𝒫H\mathcal{P}_{T}/\mathcal{P}_{H} denote the set of all tail/head passages from 𝒞\mathcal{C}. Our model receives rewards rh,rtr_{h},r_{t} according to its selections:

rt=1.0pt𝒫𝒯,rh=1.0ph𝒫\displaystyle r_{t}=1.0\iff p_{t}\in\mathcal{P_{T}},\,r_{h}=1.0\iff p_{h}\in\mathcal{P_{H}} (2)

For the 3-hop cases, we need to select an additional intermediate passage pmp_{m} between php_{h} and ptp_{t}. If we reward any pmp_{m} selection that appears in the middle of a chain in candidate chain set 𝒞\mathcal{C}, the number of feasible options can be very large. Therefore, we make our model first select the head passage php_{h} and the tail passage ptp_{t} independently and then select pmp_{m} conditioned on (ph,pt)(p_{h},p_{t}). We further restrict that each path in 𝒞\mathcal{C} must have the head passage containing an entity from qq. Then the selected pmp_{m} is only rewarded if it appears in a chain in 𝒞\mathcal{C} that starts with php_{h} and ends with ptp_{t}:

rh=1.0ph𝒫H,rt=1.0pt𝒫T\displaystyle r_{h}=1.0\iff p_{h}\in\mathcal{P}_{H},\,r_{t}=1.0\iff p_{t}\in\mathcal{P}_{T} (3)
rm=1.0path (ph,pm,pt)𝒞\displaystyle r_{m}=1.0\iff\text{path }(p_{h},p_{m},p_{t})\in\mathcal{C}

3.2 Cooperative Reasoner

To alleviate the noise in the distant supervision signal 𝒞\mathcal{C}, in addition to the conditional selection, we further propose a cooperative Reasoner model, also implemented with the MatchLSTM architecture (see Appendix A), to predict the linking entity from the selected passages. Intuitively, when the Ranker makes more accurate passage selections, the Reasoner will work with less noisy data and thus is easier to succeed. Specifically, the Reasoner learns to extract the linking entity from chains selected by a well-trained Ranker, and it benefits the Ranker training by providing extra rewards. Taking 2-hop as an example, we train the Ranker and Reasoner alternatively as a cooperative game:

Reasoner Step: Given the first passage ptp_{t}333The same method holds for selecting php_{h} first. Section 4 shows starting from the answer is empirically better. selected by the trained Ranker, the Reasoner predicts the probability of each entity ee appearing in ptp_{t}. The Reasoner is trained with the cross-entropy loss:

𝑷(e|pt,q)\displaystyle\bm{P}(e|p_{t},q) =MatchLSTM_Reader(𝑯pt,𝒒)\displaystyle=\textrm{MatchLSTM\_Reader}(\bm{H}_{p_{t}},\bm{q}) (4)
ye=\displaystyle y_{e}= {1, if eph0, otherwise.\displaystyle\begin{cases}1,\text{ if }e\in p_{h}\\ 0,\text{ otherwise}\end{cases}.

Ranker Step: Given the Reasoner’s top-1 predicted linking entity ee, the reward for Ranker at the 2nd2^{\textrm{nd}} step is defined as:

rh={1, if ph𝒫H1+r, if eph,ph𝒫H0, otherwise\displaystyle r_{h}=\begin{cases}1,\text{ if }p_{h}\in\mathcal{P}_{H}\\ 1+r,\text{ if }e\in p_{h},p_{h}\in\mathcal{P}_{H}\\ 0,\text{ otherwise}\end{cases} (5)

The extension to 3-hop cases is straightforward; the only difference is that the Reasoner reads both the selected php_{h} and ptp_{t} to output two entities. The Ranker receives one extra reward if the Reasoner picks the correct linking entity from php_{h}, so does ptp_{t}.

4 Experiments

4.1 Settings

Datasets

We evaluate our path selection model on HotpotQA bridge type questions and on the MedHop dataset. In HotpotQA, the entities are pre-processed Wiki anchor link objects and in MedHop they are drug/protein database identifiers.

For HotpotQA, two supporting passages are provided along with each question. We ignore the support annotations during training and use them to create ground truth on development set: following Wang et al. (2019), we determine the order of passages according to whether a passage contains the answer. We discard ambiguous instances.

For MedHop, there is no evidence annotated. Therefore we created a new evaluation dataset by manually annotating the correct paths for part of the development set: we first extract all candidate paths in form of passage triplets (ph,pm,pt)(p_{h},p_{m},p_{t}), such that php_{h} contains the query drug and ptp_{t} contains the answer drug, and ph/pmp_{h}/p_{m} and pm/ptp_{m}/p_{t} are connected by shared proteins. We label a chain as positive if all the drug-protein or protein-protein interactions are described in the corresponding passages. Note that the positive paths are not unique for a question.

During training we select chains based on the full passage set 𝒫\mathcal{P}; at inference time we extract the chains from the candidate set 𝒞\mathcal{C} (see Section 2).

Baselines and Evaluation Metric

We compare our model with (1) random baseline, which randomly selects a candidate chain from the distant supervision chain set 𝒞\mathcal{C}; and (2) distant supervised MatchLSTM, which uses the same base model as ours but scores and selects the passages independently. We use accuracy as our evaluation metric. As HotpotQA does not provide ground-truth linking entities, we only evaluate whether the supporting passages are fully recovered (yet our model still output the full chains). For MedHop we evaluate whether the whole predicted chain is correct. More details can be found in Appendix B. We use Pennington et al. (2014) as word embedding for HotpotQA, and Zhang et al. (2019) for MedHop.

4.2 Results

HotpotQA

We first evaluate on the 2-hop HotpotQA task. Our best performed model first selects the tail passage ptp_{t} and then the head passage php_{h}, because the number of candidates of tail is smaller (\sim2 per question). Table 1 shows the results. First, training a ranker with distant supervision performs significantly better than the random baseline, showing that the training process itself has a certain degree of denoising ability to distinguish the more informative signals from distant supervision labels. By introducing additional inductive bias of orders, the conditional selection model further improves with a large margin. Finally, our cooperative game gives the best performance, showing that a trained Reasoner has the ability of ignoring entity links that are irrelevant to the reasoning chain.

Table 2 demonstrates the effect of selecting directions, together with the methods’ recall on head passages and tail passages. The latter is evaluated on a subset of bridge-type questions in HotpotQA which has no ambiguous support annotations in passage orders; i.e., among the two human-labeled supporting passages, only one contains the answer and thus must be a tail. The results show that selecting tail first performs better. The cooperative game mainly improves the head selection.

Model HotpotQA MedHop
Random 40.3% 56.0%
Distant Supervised MatchLSTM 74.0% 59.3%
Conditional Selection 84.7% 59.3%
Cooperative Game 87.2% 62.6%
Table 1: Reasoning Chain selection results.
Model - Hotpot Head/Tail EM
Conditional Selection (Head to Tail) 80.7/95.0% 77.1%
Conditional Selection (Tail to Head) 88.1/96.2% 84.7%
   + Cooperative Reasoner 90.1/96.7% 87.2%
Table 2: Ablation test on HotpotQA.

MedHop

Results in table 1 show that recovering chains from MedHop is a much harder task: first, the large number of distant supervision chains in 𝒞\mathcal{C} introduce too much noise so the Distant Supervised Ranker improves only 3%; second, the dependent model leads to no improvement because 𝒞\mathcal{C} is strictly ordered given our data construction. Our cooperative game manages to remain effective and gives further improvement.

5 Conclusions

In this paper we propose the problem of recovering reasoning chains in multi-hop QA from weak supervision signals. Our model adopts an cooperative game approach where a ranker and a reasoner cooperate to select the most confident chains. Experiments on the HotpotQA and MedHop benchmarks show the effectiveness of the proposed approach.

References

  • Asai et al. (2019) Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2019. Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470.
  • Chen et al. (2019) Jifan Chen, Shih-ting Lin, and Greg Durrett. 2019. Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610.
  • Das et al. (2018) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. 2018. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. In Proceedings of ICLR 2018.
  • De Cao et al. (2019) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents with graph convolutional networks. In Proceedings of NAACL-HLT 2019.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019.
  • Ding et al. (2019) Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. 2019. Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of ACL 2019.
  • Godbole et al. (2019) Ameya Godbole, Dilip Kavarthapu, Rajarshi Das, Zhiyu Gong, Abhishek Singhal, Hamed Zamani, Mo Yu, Tian Gao, Xiaoxiao Guo, Manzil Zaheer, et al. 2019. Multi-step entity-centric information retrieval for multi-hop question answering. arXiv preprint arXiv:1909.07598.
  • Kundu et al. (2019) Souvik Kundu, Tushar Khot, and Ashish Sabharwal. 2019. Exploiting explicit paths for multi-hop reading comprehension. In Proceedings of ACL 2019.
  • Lei et al. (2016) Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117.
  • Min et al. (2019) Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A discrete hard em approach for weakly supervised question answering. In Proceedings of EMNLP 2019.
  • Nie et al. (2019) Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the importance of semantic retrieval for machine reading at scale. In Proceedings of EMNLP 2019.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Perez et al. (2019) Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, and Kyunghyun Cho. 2019. Finding generalizable evidence by learning to convince q&a models. In Proceedings of EMNLP 2019.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT 2018.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI.
  • Wang et al. (2019) Haoyu Wang, Mo Yu, Xiaoxiao Guo, Rajarshi Das, Wenhan Xiong, and Tian Gao. 2019. Do multi-hop readers dream of reasoning chains? arXiv preprint arXiv:1910.14520.
  • Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Learning natural language inference with lstm. In Proceedings of NAACL-HLT 2016, pages 1442–1451.
  • Wang et al. (2018) Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3: Reinforced ranker-reader for open-domain question answering. In Proceedings of AAAI 2018.
  • Welbl et al. (2018) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
  • Xiao et al. (2019) Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. In Proceedings of ACL 2019.
  • Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. 2017. Deeppath: A reinforcement learning method for knowledge graph reasoning. In Proceedings of EMNLP 2017.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP 2018.
  • Yao et al. (2019) Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of ACL 2019.
  • Yu et al. (2019) Mo Yu, Shiyu Chang, Yang Zhang, and Tommi S Jaakkola. 2019. Rethinking cooperative rationalization: Introspective extraction and complement control. arXiv preprint arXiv:1910.13294.
  • Zhang et al. (2019) Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. Biowordvec, improving biomedical word embeddings with subword information and mesh. Scientific data, 6(1):52.

Appendix A Details of MatchLSTMs for Passage Scoring and Reasoner

MatchLSTM for Passage Scoring

Given the embeddings 𝑸={𝒒0,𝒒1,,𝒒N}\bm{Q}=\{\vec{\bm{q}_{0}},\vec{\bm{q}_{1}},...,\vec{\bm{q}_{N}}\} of the question qq, and 𝑯i={𝒉i,0,𝒉i,1,,𝒉i,Mi}\bm{H}_{i}=\{\vec{\bm{h}_{i,0}},\vec{\bm{h}_{i,1}},...,\vec{\bm{h}_{i,M_{i}}}\} of each passage piPp_{i}\in P, we use the MatchLSTM Wang and Jiang (2016) to match 𝑸\bm{Q} and 𝑯i\bm{H}_{i} as follows:

ejk\displaystyle e_{jk} =𝒒jT𝒉i,k\displaystyle=\vec{\bm{q}_{j}}^{T}\vec{\bm{h}_{i,k}} (6)
𝒒j~\displaystyle\tilde{\bm{q}_{j}} =k=0Mexp(ejk)l=0Mexp(ejl)𝒉i,k\displaystyle=\sum_{k=0}^{M}\frac{\textrm{exp}(e_{jk})}{\sum_{l=0}^{M}\textrm{exp}(e_{jl})}\vec{\bm{h}_{i,k}}
𝒎~i,j\displaystyle\tilde{\bm{m}}_{i,j} =[𝒒j,𝒒j~,𝒒j𝒒j~,𝒒j𝒒j~]\displaystyle=[\bm{q}_{j},\tilde{\bm{q}_{j}},\bm{q}_{j}-\tilde{\bm{q}_{j}},\bm{q}_{j}*\tilde{\bm{q}_{j}}]
𝒎~i\displaystyle\tilde{\bm{m}}_{i} =MaxPool( GRU[𝒎~i,0,𝒎~i,1,,𝒎~i,N]).\displaystyle=\textrm{MaxPool}(\textrm{ GRU}[\tilde{\bm{m}}_{i,0},\tilde{\bm{m}}_{i,1},...,\tilde{\bm{m}}_{i,N}]).

The final vector 𝒎~i\tilde{\bm{m}}_{i} represents the matching state between qq and pip_{i}. All the 𝒎~i\tilde{\bm{m}}_{i}s are then passed to a linear layer that outputs the ranking score of each passage. We apply softmax over the scores to get the probability of passage selection P(pi|q)P(p_{i}|q). We denote the above computation as P(pi|q)=MatchLSTM(𝑯i,𝑸)P(p_{i}|q)=\textrm{MatchLSTM}(\bm{H}_{i},\bm{Q}) for simplicity.

MatchLSTM for Reasoner

Given the question embedding 𝑸r={𝒒0r,𝒒1r,,𝒒Nr}\bm{Q}^{r}=\{\vec{\bm{q}^{r}_{0}},\vec{\bm{q}^{r}_{1}},...,\vec{\bm{q}^{r}_{N}}\} and the input passage embedding 𝑯r={𝒉0r,𝒉1r,,𝒉Mr}\bm{H}^{r}=\{\vec{\bm{h}^{r}_{0}},\vec{\bm{h}^{r}_{1}},...,\vec{\bm{h}^{r}_{M}}\} of pp, the Reasoner predicts the probability of each entity in the passage being the linking entity of the next passage in the chain. We use a reader model similar to Yang et al. (2018) as our Reasoner network.

We first describe an attention sub-module. Given input sequence embedding 𝑨={𝒂0,𝒂1,,𝒂N}\bm{A}=\{\vec{\bm{a}_{0}},\vec{\bm{a}_{1}},...,\vec{\bm{a}_{N}}\} and 𝑩={𝒃0,𝒃1,,𝒃M}\bm{B}=\{\vec{\bm{b}_{0}},\vec{\bm{b}_{1}},...,\vec{\bm{b}_{M}}\}, we define ~=Attention(𝑨,𝑩)\tilde{\mathcal{M}}=\text{Attention}(\bm{A},\bm{B}):

ejk\displaystyle e_{jk} =𝒂jT𝒃k\displaystyle=\vec{\bm{a}_{j}}^{T}\vec{\bm{b}_{k}} (7)
𝒃k~\displaystyle\tilde{\bm{b}_{k}} =j=0Nexp(ejk)l=0Nexp(elk)𝒂j\displaystyle=\sum_{j=0}^{N}\frac{\textrm{exp}(e_{jk})}{\sum_{l=0}^{N}\textrm{exp}(e_{lk})}\vec{\bm{a}_{j}}
𝒎~k\displaystyle\tilde{\bm{m}}_{k} =FFN([𝒃k,𝒃k~,𝒃k𝒃k~,𝒃k𝒃k~])\displaystyle=\text{FFN}([\bm{b}_{k},\tilde{\bm{b}_{k}},\bm{b}_{k}-\tilde{\bm{b}_{k}},\bm{b}_{k}*\tilde{\bm{b}_{k}}])
~\displaystyle\tilde{\mathcal{M}} =[𝒎~0,𝒎~1,,𝒎~M],\displaystyle=[\tilde{\bm{m}}_{0},\tilde{\bm{m}}_{1},...,\tilde{\bm{m}}_{M}],

where FFN denotes a feed forward layer which projects the concatenated embedding back to the original space.

The Reasoner network consists of multiple attention layers, together with a bidirectional GRU encoder and skip connection.

1r~=\displaystyle\tilde{\mathcal{M}^{r}_{1}}= Attention(𝑸r,𝑯r)\displaystyle\text{Attention}(\bm{Q}^{r},\bm{H}^{r}) (8)
𝑯1r~=\displaystyle\tilde{\bm{H}^{r}_{1}}= Bi-GRU(1r~)\displaystyle\text{Bi-GRU}(\tilde{\mathcal{M}^{r}_{1}})
2r~=\displaystyle\tilde{\mathcal{M}^{r}_{2}}= Attention(𝑯1r,𝑯1r)\displaystyle\text{Attention}(\bm{H}^{r}_{1},\bm{H}^{r}_{1})
𝑯pr~=[𝒉p,0r,𝒉p,1r,\displaystyle\tilde{\bm{H}^{r}_{p}}=[\bm{h}^{r}_{p,0},\bm{h}^{r}_{p,1},... ,𝒉p,Mr]=Bi-GRU(1r~+2r~)\displaystyle,\bm{h}^{r}_{p,M}]=\text{Bi-GRU}(\tilde{\mathcal{M}^{r}_{1}}+\tilde{\mathcal{M}^{r}_{2}})

For each token ek,k=0,1,,Me_{k},k=0,1,...,M represented by hp,krh^{r}_{p,k} at the corresponding location, we have:

Pr(ek|𝒑)={g(𝒉p,kr), if ek is a named entity0, otherwise.\displaystyle P^{r}(e_{k}|\bm{p})=\begin{cases}g(\bm{h}^{r}_{p,k}),&\text{ if }e_{k}\text{ is a named entity}\\ 0,&\text{ otherwise}\end{cases}. (9)

where gg is the classification layer, softmax is applied across all entities to get the probability. We denote the computation above as Pr(ek|𝒑)=MatchLSTM.Reader(ek,𝒑)P^{r}(e_{k}|\bm{p})=\textrm{MatchLSTM.Reader}(e_{k},\bm{p}) for simplicity.

Appendix B Definition of Chain Accuracy

In HotpotQA, on average we can find 6 candidate chains (2-hop) in a instance, and the human labeled true reasoning chain is unique. A predicted chain is correct if the chain only contains all supporting passages (exact match of passages).

In MedHop, on average we can find 30 candidate chains (3-hop). For each candidate chain our human annotators labeled whether it is correct or not, and the correct reasoning chain is not unique. A predicted chain is correct if it is one of the chains that human labeled as correct.

The accuracy is defined as the ratio:

acc=# of correct chains predicted# of evaluation samples\displaystyle acc=\frac{\text{\# of correct chains predicted}}{\text{\# of evaluation samples}} (10)