Learning to Recover Reasoning Chains for
Multi-Hop Question Answering via Cooperative Games
Abstract
We propose the new problem of learning to recover reasoning chains from weakly supervised signals, i.e., the question-answer pairs. We propose a cooperative game approach to deal with this problem, in which how the evidence passages are selected and how the selected passages are connected are handled by two models that cooperate to select the most confident chains from a large set of candidates (from distant supervision). For evaluation, we created benchmarks based on two multi-hop QA datasets, HotpotQA and MedHop; and hand-labeled reasoning chains for the latter. The experimental results demonstrate the effectiveness of our proposed approach.
1 Introduction
NLP tasks that require multi-hop reasoning have recently enjoyed rapid progress, especially on multi-hop question answering Ding et al. (2019); Nie et al. (2019); Asai et al. (2019). Advances have benefited from rich annotations of supporting evidence, as in the popular multi-hop QA and relation extraction benchmarks, e.g., HotpotQA Yang et al. (2018) and DocRED Yao et al. (2019), where the evidence sentences for the reasoning process were labeled by human annotators.
Such evidence annotations are crucial for modern model training, since they provide finer-grained supervision for better guiding the model learning. Furthermore, they allow a pipeline fashion of model training, with each step, such as passage ranking and answer extraction, trained as a supervised learning sub-task. This is crucial from a practical perspective, in order to reduce the memory usage when handling a large amount of inputs with advanced, large pre-trained models Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019).
Manual evidence annotation is expensive, so there are only a few benchmarks with supporting evidence annotated. Even for these datasets, the structures of the annotations are still limited, as new model designs keep emerging and they may require different forms of evidence annotations. As a result, the supervision from these datasets can still be insufficient for training accurate models.

Taking question answering with multi-hop reasoning as an example, annotating only supporting passages is not sufficient to show the reasoning processes due to the lack of necessary structural information (Figure 1). One example is the order of annotated evidence, which is crucial in logic reasoning and the importance of which has also been demonstrated in text-based QA Wang et al. (2019). The other example is how the annotated evidence pieces are connected, which requires at least the definition of arguments, such as a linking entity, concept, or event. Such information has proved useful by the recently popular entity-centric methods De Cao et al. (2019); Kundu et al. (2019); Xiao et al. (2019); Godbole et al. (2019); Ding et al. (2019); Asai et al. (2019) and intuitively will be a benefit to these methods if available.
We propose a cooperative game approach to recovering the reasoning chains with the aforementioned necessary structural information for multi-hop QA. Each recovered chain corresponds to a list of ordered passages and each pair of adjacent passages is connected with a linking entity. Specifically, we start with a model, the Ranker, which selects a sequence of passages arriving at the answers, with the restriction that each adjacent passage pair shares at least an entity. This is essentially an unsupervised task and the selection suffers from noise and ambiguity. Therefore we introduce another model, the Reasoner, which predicts the exact linking entity that points to the next passage. The two models play a cooperative game and are rewarded when they find a consistent chain. In this way, we restrict the selection to satisfy not only the format constraints (i.e., ordered passages with connected adjacencies) but also the semantic constraints (i.e., finding the next passage given that the partial selection can be effectively modeled by a Reasoner). Therefore, the selection can be less noisy.
We evaluate the proposed method on datasets with different properties, i.e., HotpotQA and MedHop Welbl et al. (2018), to cover cases with both 2-hop and 3-hop reasoning. We created labeled reasoning chains for both datasets.111We will release our code and labeled evaluation data. Experimental results demonstrate the significant advantage of our proposed approach.
2 Task Definition
Reasoning Chains Examples of reasoning chains in HotpotQA and MedHop are shown in Figure 1. Formally, we aim at recovering the reasoning chain in the form of , where each is a passage and each is an entity that connects and , i.e., appearing in both passages. The last passage in the chain contains the correct answer. We say connects and in the sense that it describes a relationship between the two entities.
Our Task Given a QA pair and all its candidate passages , we can extract all possible candidate chains that satisfy the conditions mentioned above, denoted as . The goal of reasoning chain recovery is to extract the correct chains from all the candidates, given and as inputs.
Related Work Although there are recent interests on predicting reasoning chains for multi-hop QA Ding et al. (2019); Chen et al. (2019); Asai et al. (2019), they all consider a fully supervised setting; i.e., annotated reasoning chains are available. Our work is the first to recover reasoning chains in a more general unsupervised setting, thus falling into the direction of denoising over distant supervised signals. From this perspective, the most relevant studies in the NLP field includes Wang et al. (2018); Min et al. (2019) for evidence identification in open-domain QA and Lei et al. (2016); Perez et al. (2019); Yu et al. (2019) for rationale recovery.
3 Method

The task of recovering reasoning chains is essentially an unsupervised problem, as we have no access to annotated reasoning chains. Therefore, we resort to the noisy training signal from chains obtained by distant supervision. We first propose a conditional selection model that optimizes the passage selection by considering their orders (Section 3.1). We then propose a cooperative Reasoner-Ranker game (Section 3.2) in which the Reasoner recovers the linking entities that point to the next passage. This enhancement encourages the Ranker to select the chains such that their distribution is easier for a linking entity prediction model (Reasoner) to capture. Therefore, it enables our model to denoise the supervision signals while recovering chains with entity information. Figure 2 gives our overall framework, with a flow describing how the Reasoner passes additional rewards to the Ranker.
3.1 Passage Ranking Model
The key component of our framework is the Ranker model, which is provided with a question and passages from a pool of candidates, and outputs a chain of selected passages.
Passage Scoring
For each step of the chain, the Ranker estimates a distribution of the selection of each passage. To this end we first encode the question and passage with a 2-layer bi-directional GRU network, resulting in an encoded question and for each passage of length . Then we use the MatchLSTM model Wang and Jiang (2016) to get the matching score between and each and derive the distribution of passage selection (see Appendix A for details). We denote for simplicity.
Conditional Selection
To model passage dependency along the chain of reasoning, we use a hard selection model that builds a chain incrementally. Provided with the passages, at each step the Ranker computes , which is the probability of selecting passage conditioned on the query and previous states representation . Then we sample one passage according to the predicted selection probability.
(1) | ||||
The first step starts with the original question . A feed-forward network is used to project the concatenation of query encoding and selected passage encoding back to the query space, and the new query is used to select the next passage.
Reward via Distant Supervision
We use policy gradient Williams (1992) to optimize our model. As we have no access to annotated reasoning chains during training, the reward comes from distant supervision. Specifically, we reward the Ranker if a selected passage appears as the corresponding part of a distant supervised chain in . The model receives immediate reward at each step of selection.
In this paper we only consider chains consist of passages (2-hop and 3-hop chains).222It has been show that -hops can cover most real-world cases, such KB reasoning Xiong et al. (2017); Das et al. (2018). For the 2-hop cases, our model predicts a chain of two passages from the candidate set in the form of . Each candidate chain satisfies that contains the answer, while and contain a shared entity . We call the head passage and the tail passage. Let denote the set of all tail/head passages from . Our model receives rewards according to its selections:
(2) |
For the 3-hop cases, we need to select an additional intermediate passage between and . If we reward any selection that appears in the middle of a chain in candidate chain set , the number of feasible options can be very large. Therefore, we make our model first select the head passage and the tail passage independently and then select conditioned on . We further restrict that each path in must have the head passage containing an entity from . Then the selected is only rewarded if it appears in a chain in that starts with and ends with :
(3) | ||||
3.2 Cooperative Reasoner
To alleviate the noise in the distant supervision signal , in addition to the conditional selection, we further propose a cooperative Reasoner model, also implemented with the MatchLSTM architecture (see Appendix A), to predict the linking entity from the selected passages. Intuitively, when the Ranker makes more accurate passage selections, the Reasoner will work with less noisy data and thus is easier to succeed. Specifically, the Reasoner learns to extract the linking entity from chains selected by a well-trained Ranker, and it benefits the Ranker training by providing extra rewards. Taking 2-hop as an example, we train the Ranker and Reasoner alternatively as a cooperative game:
Reasoner Step: Given the first passage 333The same method holds for selecting first. Section 4 shows starting from the answer is empirically better. selected by the trained Ranker, the Reasoner predicts the probability of each entity appearing in . The Reasoner is trained with the cross-entropy loss:
(4) | ||||
Ranker Step: Given the Reasoner’s top-1 predicted linking entity , the reward for Ranker at the step is defined as:
(5) |
The extension to 3-hop cases is straightforward; the only difference is that the Reasoner reads both the selected and to output two entities. The Ranker receives one extra reward if the Reasoner picks the correct linking entity from , so does .
4 Experiments
4.1 Settings
Datasets
We evaluate our path selection model on HotpotQA bridge type questions and on the MedHop dataset. In HotpotQA, the entities are pre-processed Wiki anchor link objects and in MedHop they are drug/protein database identifiers.
For HotpotQA, two supporting passages are provided along with each question. We ignore the support annotations during training and use them to create ground truth on development set: following Wang et al. (2019), we determine the order of passages according to whether a passage contains the answer. We discard ambiguous instances.
For MedHop, there is no evidence annotated. Therefore we created a new evaluation dataset by manually annotating the correct paths for part of the development set: we first extract all candidate paths in form of passage triplets , such that contains the query drug and contains the answer drug, and and are connected by shared proteins. We label a chain as positive if all the drug-protein or protein-protein interactions are described in the corresponding passages. Note that the positive paths are not unique for a question.
During training we select chains based on the full passage set ; at inference time we extract the chains from the candidate set (see Section 2).
Baselines and Evaluation Metric
We compare our model with (1) random baseline, which randomly selects a candidate chain from the distant supervision chain set ; and (2) distant supervised MatchLSTM, which uses the same base model as ours but scores and selects the passages independently. We use accuracy as our evaluation metric. As HotpotQA does not provide ground-truth linking entities, we only evaluate whether the supporting passages are fully recovered (yet our model still output the full chains). For MedHop we evaluate whether the whole predicted chain is correct. More details can be found in Appendix B. We use Pennington et al. (2014) as word embedding for HotpotQA, and Zhang et al. (2019) for MedHop.
4.2 Results
HotpotQA
We first evaluate on the 2-hop HotpotQA task. Our best performed model first selects the tail passage and then the head passage , because the number of candidates of tail is smaller (2 per question). Table 1 shows the results. First, training a ranker with distant supervision performs significantly better than the random baseline, showing that the training process itself has a certain degree of denoising ability to distinguish the more informative signals from distant supervision labels. By introducing additional inductive bias of orders, the conditional selection model further improves with a large margin. Finally, our cooperative game gives the best performance, showing that a trained Reasoner has the ability of ignoring entity links that are irrelevant to the reasoning chain.
Table 2 demonstrates the effect of selecting directions, together with the methods’ recall on head passages and tail passages. The latter is evaluated on a subset of bridge-type questions in HotpotQA which has no ambiguous support annotations in passage orders; i.e., among the two human-labeled supporting passages, only one contains the answer and thus must be a tail. The results show that selecting tail first performs better. The cooperative game mainly improves the head selection.
Model | HotpotQA | MedHop |
---|---|---|
Random | 40.3% | 56.0% |
Distant Supervised MatchLSTM | 74.0% | 59.3% |
Conditional Selection | 84.7% | 59.3% |
Cooperative Game | 87.2% | 62.6% |
Model - Hotpot | Head/Tail | EM |
---|---|---|
Conditional Selection (Head to Tail) | 80.7/95.0% | 77.1% |
Conditional Selection (Tail to Head) | 88.1/96.2% | 84.7% |
+ Cooperative Reasoner | 90.1/96.7% | 87.2% |
MedHop
Results in table 1 show that recovering chains from MedHop is a much harder task: first, the large number of distant supervision chains in introduce too much noise so the Distant Supervised Ranker improves only 3%; second, the dependent model leads to no improvement because is strictly ordered given our data construction. Our cooperative game manages to remain effective and gives further improvement.
5 Conclusions
In this paper we propose the problem of recovering reasoning chains in multi-hop QA from weak supervision signals. Our model adopts an cooperative game approach where a ranker and a reasoner cooperate to select the most confident chains. Experiments on the HotpotQA and MedHop benchmarks show the effectiveness of the proposed approach.
References
- Asai et al. (2019) Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2019. Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470.
- Chen et al. (2019) Jifan Chen, Shih-ting Lin, and Greg Durrett. 2019. Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610.
- Das et al. (2018) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. 2018. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. In Proceedings of ICLR 2018.
- De Cao et al. (2019) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents with graph convolutional networks. In Proceedings of NAACL-HLT 2019.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019.
- Ding et al. (2019) Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. 2019. Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of ACL 2019.
- Godbole et al. (2019) Ameya Godbole, Dilip Kavarthapu, Rajarshi Das, Zhiyu Gong, Abhishek Singhal, Hamed Zamani, Mo Yu, Tian Gao, Xiaoxiao Guo, Manzil Zaheer, et al. 2019. Multi-step entity-centric information retrieval for multi-hop question answering. arXiv preprint arXiv:1909.07598.
- Kundu et al. (2019) Souvik Kundu, Tushar Khot, and Ashish Sabharwal. 2019. Exploiting explicit paths for multi-hop reading comprehension. In Proceedings of ACL 2019.
- Lei et al. (2016) Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117.
- Min et al. (2019) Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A discrete hard em approach for weakly supervised question answering. In Proceedings of EMNLP 2019.
- Nie et al. (2019) Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the importance of semantic retrieval for machine reading at scale. In Proceedings of EMNLP 2019.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Perez et al. (2019) Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, and Kyunghyun Cho. 2019. Finding generalizable evidence by learning to convince q&a models. In Proceedings of EMNLP 2019.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT 2018.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI.
- Wang et al. (2019) Haoyu Wang, Mo Yu, Xiaoxiao Guo, Rajarshi Das, Wenhan Xiong, and Tian Gao. 2019. Do multi-hop readers dream of reasoning chains? arXiv preprint arXiv:1910.14520.
- Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Learning natural language inference with lstm. In Proceedings of NAACL-HLT 2016, pages 1442–1451.
- Wang et al. (2018) Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3: Reinforced ranker-reader for open-domain question answering. In Proceedings of AAAI 2018.
- Welbl et al. (2018) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302.
- Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
- Xiao et al. (2019) Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. In Proceedings of ACL 2019.
- Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. 2017. Deeppath: A reinforcement learning method for knowledge graph reasoning. In Proceedings of EMNLP 2017.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP 2018.
- Yao et al. (2019) Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of ACL 2019.
- Yu et al. (2019) Mo Yu, Shiyu Chang, Yang Zhang, and Tommi S Jaakkola. 2019. Rethinking cooperative rationalization: Introspective extraction and complement control. arXiv preprint arXiv:1910.13294.
- Zhang et al. (2019) Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. Biowordvec, improving biomedical word embeddings with subword information and mesh. Scientific data, 6(1):52.
Appendix A Details of MatchLSTMs for Passage Scoring and Reasoner
MatchLSTM for Passage Scoring
Given the embeddings of the question , and of each passage , we use the MatchLSTM Wang and Jiang (2016) to match and as follows:
(6) | ||||
The final vector represents the matching state between and . All the s are then passed to a linear layer that outputs the ranking score of each passage. We apply softmax over the scores to get the probability of passage selection . We denote the above computation as for simplicity.
MatchLSTM for Reasoner
Given the question embedding and the input passage embedding of , the Reasoner predicts the probability of each entity in the passage being the linking entity of the next passage in the chain. We use a reader model similar to Yang et al. (2018) as our Reasoner network.
We first describe an attention sub-module. Given input sequence embedding and , we define :
(7) | ||||
where FFN denotes a feed forward layer which projects the concatenated embedding back to the original space.
The Reasoner network consists of multiple attention layers, together with a bidirectional GRU encoder and skip connection.
(8) | ||||
For each token represented by at the corresponding location, we have:
(9) |
where is the classification layer, softmax is applied across all entities to get the probability. We denote the computation above as for simplicity.
Appendix B Definition of Chain Accuracy
In HotpotQA, on average we can find 6 candidate chains (2-hop) in a instance, and the human labeled true reasoning chain is unique. A predicted chain is correct if the chain only contains all supporting passages (exact match of passages).
In MedHop, on average we can find 30 candidate chains (3-hop). For each candidate chain our human annotators labeled whether it is correct or not, and the correct reasoning chain is not unique. A predicted chain is correct if it is one of the chains that human labeled as correct.
The accuracy is defined as the ratio:
(10) |