This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Could you give me a hint?
Generating inference graphs for defeasible reasoning

Aman Madaan   , Dheeraj Rajagopal 11footnotemark: 1  , Niket Tandon 11footnotemark: 1  , Yiming Yang, Eduard Hovy
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA
Allen Institute for Artificial Intelligence, Seattle, WA, USA
{dheeraj,amadaan,yiming,hovy}@cs.cmu.edu
{nikett}@allenai.org
  Equal Contribution
Abstract

Defeasible reasoning is a mode of reasoning where conclusions can be overturned by taking into account new evidence. A commonly used method in cognitive science and logic literature is to handcraft argumentation supporting inference graphs. While humans find inference graphs very useful for reasoning, constructing them at scale is difficult. In this paper, we automatically generate such inference graphs through transfer learning from a related NLP task that shares the kind of reasoning that inference graphs support. Through automated metrics and human evaluation, we find that our method generates meaningful graphs for the defeasible inference task. Human accuracy on this task improves by 20% by consulting the generated graphs. Our findings open up exciting new research avenues for cases where machine reasoning can help human reasoning.111A dataset of 230,000 influence graphs for each defeasible query is located at: https://tinyurl.com/defeasiblegraphs.

1 Introduction

Defeasible inference Rudinger et al. (2020) is a mode of reasoning in which given a premise 𝐏\mathbf{P} (Rob went for a hike), a hypothesis 𝐇\mathbf{H} (Rob saw an elephant, it was pink) may be weakened or overturned in light of new evidence i.e., an update 𝐔\mathbf{U} (Rob often has hallucinations). Given the non-monotonic nature of this reasoning, humans find it challenging to master this task Morgan (2004). This problem has been widely studied in classical AI through logic Israel (1980); McCarthy (1981), and in cognitive science through argumentative models Pollock (1987). A prominent approach is to support defeasible inference through argumentations by constructing an inference graph (Pollock, 2009).

Despite their prominence Bentahar et al. (2010), argumentative models are not scalable because an inference graph needs to be handcrafted for every example. Recently, Rudinger et al. (2020) proposed two auxiliary tasks related to defeasible inference: (i) an NLI task to predict whether an update 𝐔\mathbf{U} would weaken or strengthen a hypothesis 𝐇\mathbf{H}, and (ii) a generative task to generate an update 𝐔\mathbf{U} given a premise 𝐏\mathbf{P} and a hypothesis 𝐇\mathbf{H}. However, this only addresses a part of the problem because their inference is still not supported by the line of reasoning that a human typically uses to solve this task, namely mediators (e.g., hallucinations can be deceptive) and contextualizers (some elephants can have mutated gene which makes them look different) that are inherently embedded in an inference graph, limiting their utility for humans (figure 1).

In this paper, we adopt the concept of an inference graph for defeasible reasoning from cognitive science and provide a computational model to make their generation scalable. Training such a model would require a large amount of annotated inference graphs, which will be too expensive to obtain. Instead, our solution is to draw a parallel to a related reasoning task in NLP Tandon et al. (2019), where the reasoning is supported by a graph that we find has similarities with the kind of reasoning that an inference graph supports. We train a model that can learn from the NLP task and effectively transfer it to generate inference graphs. Such transfer learning is made possible due to the powerful seq-to-seq neural language models that did not exist before.

Refer to caption
Figure 1: (a) An example of an Inference Graph adapted from Pollock (2009) and (b) Structure of an Influence Graph adapted from wiqa (Tandon et al., 2019) dataset. The adapted influence graph incorporates the contextualizers, mediators, hypotheses and situations, making them useful for defeasible reasoning.

The contributions of this paper are the answers to the following two research questions:

  • rq1

    Can we automate the construction of the argumentation supporting inference graphs? In §2, we show that we can effectively construct meaningful graphs using transfer learning.

  • rq2

    Can our generated graphs help improve human performance? In §3, we show that humans leverage generated graphs to improve their performance on a previously reported benchmark.

2 rq1: Generating argumentation supporting Inference Graphs

We start by drawing parallels to a counterfactual reasoning task in NLP - the wiqa (Tandon et al., 2019) task. wiqa consists of a set of procedural passages, each accompanied by a human-curated influence graph. The influence graph captures the causal influences between the events in the context of the process described by the passage. We draw a connection between inference graphs Pollock (2009) and influence graphs Tandon et al. (2019) by drawing parallels between their reasoning structures. In essence, each inference graph from Pollock (1987) can be instantiated via an influence graph from Tandon et al. (2019) by interpreting the nodes in both the graphs as follows (Figure 1):

  1. i.

    Contextualizers (C): these nodes set the context around a situation and connect to the 𝐏\mathbf{P} in some way.

  2. ii.

    Updates (U): these nodes are new situations that emerge which might overturn an inference.

  3. iii.

    Hypothesis (H): Hypothesis nodes describes the outcome/conclusion of the situation.

  4. iv.

    Mediators (M): Mediators are nodes that help bridge the knowledge gap between a situation and a hypothesis node by explaining their connection explicitly.

Figure 1 presents an example to highlight the similarities between the two graphs by labeling an example node adapted from Pollock (2009), and the structure of the influence graph from Tandon et al. (2019) with the four node types that we defined above. A green edge indicates that the source node has a positive influence on the target node, and a red edge indicates a negative influence. Further, each node can either act as a strengthener (+) or a weakener (-) for the hypothesis. Consequently, these graphs can support similar type of reasoning e.g., the effect of 𝐔\mathbf{U} on 𝐇\mathbf{H} and how this can change in light of external influences (C) is captured by graph paths 𝐂\mathbf{C}+ to 𝐔\mathbf{U} and from 𝐔\mathbf{U} via a mediator node (𝐌\mathbf{M}+/𝐌\mathbf{M}-) to 𝐇\mathbf{H}. Inspired by these similarities, we hypothesize that influence graphs can be used to supplement defeasible reasoning.

2.1 Influence Graphs Generation

To obtain an influence graph for each defeasible query, we perform a zero-shot transfer from wiqa Tandon et al. (2019), a corpus of 2100 (passage, influence graphs) pairs.222Dataset details in the Appendix §E..

Training :

We treat influence graph generation as a sequence-to-sequence mapping task. We leverage wiqa to derive parallel data {(𝐬𝐞𝐪ipi,𝐬𝐞𝐪opi)}i=1N\{(\mathbf{seq}_{ip}^{i},\mathbf{seq}_{op}^{i})\}_{i=1}^{N} for the task. Let (𝐓i,𝐆i)(\mathbf{T}_{i},\mathbf{G}_{i}) be a sample in wiqa, where 𝐓i\mathbf{T}_{i} is the passage text (e.g. describing how viruses spread), and 𝐆i\mathbf{G}_{i} is the corresponding influence graph (e.g., Figure 2). To create tokens of the input sequence 𝐬𝐞𝐪ipi\mathbf{seq}_{ip}^{i}, the model trains best with explicit markers:333An example shown in Appendix §A.

𝐬𝐞𝐪ipi\displaystyle\mathbf{seq}_{ip}^{i} =Premise: 𝐓iUpdate: 𝐔iless/ more𝐇i\displaystyle=\text{Premise: }\mathbf{T}_{i}\mid\text{Update: }\mathbf{U}_{i}\mid\text{less/ more}\text{: }\mathbf{H}_{i} (1)

where 𝐓i\mathbf{T}_{i} is the passage text (e.g. steps describing how viruses spread) and 𝐔i\mathbf{U}_{i} and 𝐇i\mathbf{H}_{i} are nodes of 𝐆i\mathbf{G}_{i} (these are phrases as shown in Figure 2).

Refer to caption
Figure 2: An example of an influence graph similar to ones in WIQA that we train on.

The output 𝐬𝐞𝐪opi\mathbf{seq}_{op}^{i} is set to a dot-string representation of the corresponding influence graph 𝐆i\mathbf{G}_{i}, as such a representation was shown to be effective at extracting high-quality graphs Madaan and Yang (2021) from free-form text using language models (examples in the appendix). Thus, each passage-graph pair (𝐓i,𝐆i)(\mathbf{T}_{i},\mathbf{G}_{i}) from wiqa is mapped to an input-output pair 𝒟=(𝐬𝐞𝐪ipi,𝐬𝐞𝐪opi)\mathcal{D}=(\mathbf{seq}_{ip}^{i},\mathbf{seq}_{op}^{i}). We use this corpus to fine-tune an autoregressive language model \mathcal{L} for graph generation. Essentially, the fine-tuned \mathcal{L} allows us to efficiently sample an influence graph for a given input sequence 𝐬𝐞𝐪ipj\mathbf{seq}_{ip}^{j} by drawing samples from 𝐆jPθ(y𝐬𝐞𝐪ipj)\mathbf{G}_{j}\sim P_{\theta}(y\mid\mathbf{seq}_{ip}^{j}) using greedy sampling, where θ\theta denotes the parameters of the language model.

Zero-shot Transfer to Defeasible Inference :

We use the model \mathcal{L} trained on wiqa to generate inference graphs on the defeasible inference dataset by Rudinger et al. (2020). We obtain an influence graph for each defeasible input (𝐏\mathbf{P}, 𝐇\mathbf{H}, 𝐔\mathbf{U}) by converting it to an input sequence that can be fed to \mathcal{L} by filling the template (1). This conversion from (𝐏\mathbf{P}, 𝐇\mathbf{H}, 𝐔\mathbf{U}) to template (1) is done by setting the premise 𝐏\mathbf{P} as the context passage 𝐓\mathbf{T}, the update 𝐔\mathbf{U} as the node 𝐔\mathbf{U}, and the attenuated and strengthened outcomes are simulated by prefixing the hypothesis 𝐇\mathbf{H} with the tokens Less and More respectively. This input is then passed to the \mathcal{L} to generate an influence graph.

Results on Influence Graph Generation

We use T5-11B Raffel et al. (2020) fine-tuned on 𝒟\mathcal{D} derived from wiqa (§2.1) as our graph generation language model (\mathcal{L}). All the graphs generated by our model were in valid dot format. We use the standard generation metrics BLEU Papineni et al. (2002) and ROUGE Lin (2004) to evaluate \mathcal{L} on the test split of wiqa. Each node NiN_{i} in the reference graph is compared with the corresponding generated node Ni^\hat{N_{i}} using BLEU(Ni,Ni^)\texttt{BLEU}(N_{i},\hat{N_{i}}) (Node-BLEU). Further, node-edge-node pairs (neighbors) (Ni,Nj)(N_{i},N_{j}) and (Ni^,Nj^)(\hat{N_{i}},\hat{N_{j}}) are compared using Rel-BLEU = HM(BLEU(Ni,Ni^),BLEU(Nj,Nj^))\texttt{HM}(\texttt{BLEU}(N_{i},\hat{N_{i}}),\texttt{BLEU}(N_{j},\hat{N_{j}})) where HM is the harmonic mean. These metrics are averaged over the graph (i.e., across the nodes and the edges), and further averaged across the corpus. We perform these experiments across two different language models: gpt-2-medium Radford et al. (2019) and t5-11b. Finally, we calculate the overlap in the edge structures of the reference and generated graphs match as Edge-match%. We report the numbers in Table 1, and include a random baseline for reference. A random baseline will correctly generate the nodes 𝐒\mathbf{S}, 𝐇\mathbf{H}+, and 𝐇\mathbf{H}- as they are part of the query (38\frac{3}{8} nodes). As neither of these nodes are connected to another, the random baseline will likely not generate any node pair correctly ( Rel-BLEU \sim 0). Since two unique graph structures are possible Tandon et al. (2019), a random baseline would get Edge-match \sim 50%. Table 1 shows that our t5-based model is able to generate syntactically valid (high edge-match) and semantically meaningful graphs. Additionally, we find that our generated graphs are helpful to humans on a downstream task, as described next.

Model Random gpt-2-medium t5-11b
Node-BLEU 37.5 46.05 50.94
Rel-BLEU 0.0 19.34 33.01
Edge-match% 50.0 92.86 97.63
Table 1: Results on automated metrics showing that our t5-11b model is able to generate very accurate graph structure and meaningful nodes that sufficiently match the reference nodes.

3 rq2: Do generated graphs help humans at defeasible reasoning?

Human Evaluation

Rudinger et al. (2020) performed a human evaluation on 2000 defeasible queries, where given (𝐏\mathbf{P}, 𝐇\mathbf{H}, 𝐔\mathbf{U}), the task was to label the nature of the effect of 𝐔\mathbf{U} on 𝐇\mathbf{H} as Intensifies or Attenuates. Three human judges labeled each query, and the majority label was then compared with the ground-truth to ascertain the accuracy. In their setup, human judges were collectively right on 1745 samples (correct pool) and wrong on 255 samples (wrong pool). We create a challenging pool of 510 queries for the human judges by combining the 255 queries in the wrong pool with 255 queries sampled from the correct pool, giving a baseline accuracy of 50% for this eval pool. Each query in this pool is supplemented with a generated influence graph (§2).444Discussion on IRB exemption in Section §B. We found that our generated influence graphs showed high-levels of redundancy in contextualizers and mediators, with about 46% of the generated influence graphs repeating these nodes. We found that humans find it simpler to follow positive chains of influence, so to reduce their cognitive load, we post-process each influence graph to only retain the strengthening contextualizer (Figure 1), the situation (𝐔\mathbf{U}), the strengthening mediator (M+), and the hypothesis (𝐇\mathbf{H}).

In order to establish comparable gains, we replicate the evaluation setup of Rudinger et al. (2020) by using use the same Amazon Mechanical Turk template and the instruction set, and the same pool of 230 qualified annotators that Rudinger et al. (2020) selected based on a paid qualification test, in which the workers were asked to answer SNLI queries of varying levels of difficulty. We paid slightly above $15 per hour for the tasks.

For each query, in addition to answering the defeasible question, three judges were asked to evaluate the augmented influence graphs on two aspects:

  1. i)

    Is the influence graph useful? The judges were allowed to select from the following:

    1. (a)

      helpful: the graph was crucial in helping towards answering the question

    2. (b)

      relevant but not helpful: the graph had the right topic (relevant to the question) but did not help in answering the question.

    3. (c)

      irrelevant or misleading: the graph was irrelevant to the question or misled the human judge to a wrong answer.

  2. ii)

    Why is the influence graph useful? The judges were given an option to highlight the most useful aspect of the generated influence graph. They were allowed to tag one or more of the following aspects as the most helpful: i) Extraneous node, ii) Mediating node, and iii) Structure of the graph.

We summarize the key findings below.

Finding 1: influence graphs are helpful and relevant

As Table 2 shows, a large majority of the human judges found the influence graphs to be helpful or relevant. We calculate the inter-annotator agreement for this question using majority-agreement =1Ni=1Nmai=\frac{1}{N}\sum_{i=1}^{N}\texttt{ma}_{i} where mai\texttt{ma}_{i} indicates a majority agreement for the ithi^{th} sample (i.e., at least 2 out of 3 judges agreed on the label for the sample). The majority-agreement (ma) on these labels was 0.83. The judges marked about 25% of the graphs as relevant but not helpful. The graphs in such cases were on topic but not helpful in answering the query, thereby distinguishing the cases when the graph was crucial in reaching the correct answer. Finally, we note that the graphs provided as hints could have been helpful in two ways: by helping the human annotators arrive at the answer, or by reinforcing their mental picture that helped them in making the right decision. Future research in this direction is needed to study these aspects in depth.

Helpful 47.25
Relevant but not helpful 25.09
Irrelevant or misleading 10.58
No majority agreement 17.05
Table 2: Helpfulness of the augmentations.

Finding 2: Mediators are the most helpful for defeasible queries

For every sample, we asked the human judges to mark which parts of the graph was the most helpful (as shown in Figure 6 in Appendix §D.1). The judges could select more than one aspect of the graph if they found multiple useful aspects. Table 3 shows the percentage of human judges that selected the particular graph aspect as most helpful. We observe that 49.48% of the judges who found the graphs useful indicated the mediator node as the most helpful. This indicates that while there may be other events that impact 𝐔\mathbf{U} and 𝐇\mathbf{H}, the mediating events are the most informative in determining the type of link between them.

Aspect % marked useful
Mediator 49.48
Extraneous 32.03
Structure 12.81
None helpful 5.68
Table 3: Most useful aspects of an influence graph.

Finding 3: Machine generated influence graphs help humans in defeasible reasoning

Table 4 shows that performance improves across all three tasks when the defeasible query is augmented with an influence graph. On our challenging set of 510 queries, the overall accuracy jumps nearly 20 points from 0.50 to 0.698. Figure 3 highlights that 113 queries that were previously given the wrong answers were marked correctly when augmented with the influence graphs.

Dataset Human Human
Rudinger et al. (2020) (ours)
snli 0.461 ±\pm 0.11 0.553 ±\pm 0.11
social 0.628 ±\pm 0.07 0.814 ±\pm 0.06
atomic 0.418 ±\pm 0.06 0.657 ±\pm 0.06
overall 0.500 ±\pm 0.04 0.698 ±\pm 0.04
Table 4: Human performance (accuracy) on the three tasks with and without generated influence graphs along with Wilson’s score intervals for α=95%\alpha=95\%. We tested the statistical significance of these results using the McNemar’s test McNemar (1947) and found the results to be statistically highly significant (p<1e6p<1e-6).
Refer to caption
Figure 3: Human performance before and after the human judges were provided with the influence graph.

4 Discussion and Conclusion

Our work takes the idea of using inference graphs for defeasible inference and scales up its usability by automatically generating and augmenting them to a downstream defeasible task that both humans and machines are known to find difficult. We identify that the contextualizer and mediator nodes are crucial to defeasible inference, and show that our generated graphs generate these critical nodes effectively. Humans perform significantly better (20% absolute improvement) across diverse defeasible datasets and overwhelmingly attribute their success to the mediator nodes – giving insights into what helps and why. In this case study, we show that machines can fill the gaps in human knowledge when for defeasible reasoning. While we establish that humans are helped by these graphs, a further investigation on how (and if) the graphs reinforced their beliefs, and what additional information in the graphs was beneficial to their understanding is essential. Furthermore, a deeper understanding of the trade-offs (time spent in answering these questions with and without the graphs) also forms important future work.

Acknowledgments

We would like to thank Peter Clark for the thoughtful discussions, and the anonymous reviewers for valuable feedback.

This material is partly based on research sponsored in part by the Air Force Research Laboratory under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government. We would like to thank Google for providing the TPU machines for conducting experiments.

References

  • Bentahar et al. (2010) J. Bentahar, B. Moulin, and M. Bélanger. 2010. A taxonomy of argumentation models used for knowledge representation. Artificial Intelligence Review, 33:211–259.
  • Israel (1980) D. Israel. 1980. What’s wrong with non-monotonic logic? In AAAI.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Madaan and Yang (2021) Aman Madaan and Yiming Yang. 2021. Neural language modeling for contextualized temporal graph generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 864–881, Online. Association for Computational Linguistics.
  • McCarthy (1981) J. McCarthy. 1981. Some philosophical problems from the standpoint of artificial intelligence. Machine intelligence.
  • McNemar (1947) Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157.
  • Morgan (2004) C. Morgan. 2004. The nature of nonmonotonic reasoning. Minds and Machines, 10:321–360.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Pollock (1987) J. Pollock. 1987. Defeasible reasoning. Cogn. Sci., 11:481–518.
  • Pollock (2009) J. Pollock. 2009. A recursive semantics for defeasible reasoning. In Argumentation in Artificial Intelligence.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.
  • Rudinger et al. (2020) Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. 2020. Thinking like a skeptic: Defeasible inference in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4661–4675, Online. Association for Computational Linguistics.
  • Tandon et al. (2019) Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. 2019. Wiqa: A dataset for “what if…” reasoning over procedural text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6078–6087.

Appendix A Sample input-output sequence for training \mathcal{L}

We now present a sample input-output sequence used to train out \mathcal{L} for graph generation. The input-output sample (𝐬𝐞𝐪𝐢𝐩,𝐬𝐞𝐪𝐨𝐩\mathbf{seq_{ip}},\mathbf{seq_{op}}) is presented below. As mentioned in Section

  1. 1.

    As described in section 2.1, each input sequence 𝐬𝐞𝐪𝐢𝐩\mathbf{seq_{ip}} is formatted in a special template to be fed to the language model (Template (1)). We show an example of the same next for a sample from our training data. Premise: Sunlight shines on plants. Cells with chlorophyll in them \ldots other parts of the plant. Situation : more minerals are absorbed Less : LESS sugar and oxygen being produced More : MORE sugar and oxygen being produced

  2. 2.

    Each output graph is encoded in as a dot string. The output dot sequence 𝐬𝐞𝐪𝐨𝐩\mathbf{seq_{op}} corresponding to the input shown above is: strict digraph "C+ : less minerals in the soil [OR] less root system" -> "S : more minerals are absorbed" [label=hurts]; "C- :more minerals in the soil [OR] a better root system" -> "S : more minerals are absorbed" [label=helps]; "S : more minerals are absorbed" -> "M- : less conversion into sugars [OR] less oxygen produced" [label=hurts]; "S : more minerals are absorbed" -> "M+ : more conversion into sugars" [label=helps]; "S- : less minerals absorbed [OR] less root system" -> "M+ : more conversion into sugars" [label=hurts]; "M- : less conversion into sugars [OR] less oxygen produced" -> "H- : LESS sugar and oxygen being produced" [label=helps]; "M- : less conversion into sugars [OR] less oxygen produced" -> "H+ : MORE sugar and oxygen being produced" [label=hurts]; "M+ : more conversion into sugars" -> "H+ : MORE sugar and oxygen being produced" [label=helps]; "M+ : more conversion into sugars" -> "H- : LESS sugar and oxygen being produced" [label=hurts];

Refer to caption
Figure 4: The influence graph corresponding to dot code shown in 𝐬𝐞𝐪𝐨𝐩\mathbf{seq_{op}}

Appendix B IRB Exemption

Our study was not an experimentation on humans (posed no identifiable risk to the human judges), did not collect any identifying information, and ensured it involved only adults. As per the IRB guidelines, this falls under the purview of human research, and we are not publishing individual workers’ answers but rather the data is tallied up, much like a “benign behavioral intervention.” This exempts us from IRB (category 3 of Federal Regulations for Protection of Human Research Subjects https://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/).

Appendix C Infrastructure and hyperparameters

To train the T5-11B model, comprising of 11 billion parameters, we used v3-8 TPUs. The average time to train was 7 hours for about 10 epochs. We used the same hyperparameters as provided with the T5 checkpoint at gs://t5-data/pretrained_models/11B. We used maximum block size of 512 tokens, and max generation length set to 512. For decoding, we sample according to predicted distribution. We train the GPT-2 model on a Nvidia GTX 2080 Ti, and training the model takes about 30 minutes per epoch.

We use the medium (355M) variant of gpt-2 Radford et al. (2019) with 24 layers, 1024 hidden size, 16 attention heads.

Appendix D Details of our Mechanical Turk Setup

We follow the same instructions for humans as Rudinger et al. (2020)555We are grateful to the authors of Rudinger et al. (2020) for sharing their mechanical turk setup template with us., and only additionally provided instructions for the inference graph. We used a pool of 230 annotators that were previously qualified and selected to do the defeasible inference task, thus providing a fair comparison to their setup. Eventually 12 workers out of these 230 workers worked on our HITs. The graph we showed to humans was a subgraph of the inference graph, where the selected path has the relevant content from the inference graph to avoid showing redundant opposite edges. These redundant edges are useful in training a model as the model must jointly predict all the nodes, but this is redundant for humans. Figure 5 shows this subgraph.

Refer to caption
Figure 5: Part of the generated influence graph that is presented in the hit.

D.1 A sample HIT

We now show a sample HIT in Figure 6. We had two set of annotations in every HIT.

Refer to caption
Figure 6: A sample HIT in mechanical turk.

D.2 Examples that helped humans

Next, we show two examples (Figure 7, Figure 8) where humans were previously unsuccessful on this answer (in the original setup of Rudinger et al. (2020)), and were successful now having looked at the inference graphs. The humans marked that the mediator nodes and the contextualizer nodes provide useful information.

Refer to caption
Figure 7: An example where the graph helped the human in getting the correct answer, that humans were unsuccessful on, in the past.
Refer to caption
Figure 8: Another example where the graph helped the human in getting the correct answer, that humans were unsuccessful on, in the past.

Appendix E Dataset

Dataset Split # Samples Total
wiqa train 1522 2107
test 189
dev 152
atomic train 35,001 42,977
test 4137
dev 3839
social train 88,675 92,295
test 1836
dev 1784
snli train 77,015 95,795
test 9438
dev 9342
Table 5: Number of samples in each dataset by split. atomic, snli, social are available at https://github.com/rudinger/defeasible-nli, wiqa is avilable at https://allenai.org/data/wiqa