Weakly Supervised Neuro-Symbolic Module Networks for Numerical Reasoning

Amrita Saha Salesforce AI Research Shafiq Joty Steven C.H. Hoi Salesforce AI Research

Abstract

Neural Module Networks (NMNs) have been quite successful in incorporating explicit reasoning as learnable modules in various question answering tasks, including the most generic form of numerical reasoning over text in Machine Reading Comprehension (MRC). However, to achieve this, contemporary NMNs need strong supervision in executing the query as a specialized program over reasoning modules and fail to generalize to more open-ended settings without such supervision. Hence we propose Weakly-Supervised Neuro-Symbolic Module Network (WNSMN) trained with answers as the sole supervision for numerical reasoning based MRC. It learns to execute a noisy heuristic program obtained from the dependency parsing of the query, as discrete actions over both neural and symbolic reasoning modules and trains it end-to-end in a reinforcement learning framework with discrete reward from answer matching. On the numerical-answer subset of DROP, WNSMN outperforms NMN by 32% and the reasoning-free language model GenBERT by 8% in exact match accuracy when trained under comparable weak supervised settings. This showcases the effectiveness and generalizability of modular networks that can handle explicit discrete reasoning over noisy programs in an end-to-end manner.

1 Introduction

End-to-end neural models have proven to be powerful tools for an expansive set of language and vision problems by effectively emulating the input-output behavior. However, many real problems like Question Answering (QA) or Dialog need more interpretable models that can incorporate explicit reasoning in the inference. In this work, we focus on the most generic form of numerical reasoning over text, encompassed by the reasoning-based MRC framework. A particularly challenging setting for this task is where the answers are numerical in nature as in the popular MRC dataset, DROP (Dua et al., 2019). Figure 1 shows the intricacies involved in the task, (i) passage and query language understanding, (ii) contextual understanding of the passage date and numbers, and (iii) application of quantitative reasoning (e.g., max, not) over dates and numbers to reach the final numerical answer.

Three broad genres of models have proven successful on the DROP numerical reasoning task.
First, large-scale pretrained language models like GenBERT (Geva et al., 2020) uses a monolithic Transformer architecture and decodes numerical answers digit-by-digit. Though they deliver mediocre performance when trained only on the target data, their competency is derived from pretraining on massive synthetic data augmented with explicit supervision of the gold numerical reasoning.
Second kind of models are the reasoning-free hybrid models like NumNet (Ran et al., 2019), NAQANet (Dua et al., 2019), NABERT+ (Kinley & Lin, 2019) and MTMSN (Hu et al., 2019), NeRd (Chen et al., 2020). They explicitly incorporate numerical computations in the standard extractive QA pipeline by learning a multi-type answer predictor over different reasoning types (e.g., max/min, diff/sum, count, negate) and directly predicting the corresponding numerical expression, instead of learning to reason. This is facilitated by exhaustively precomputing all possible outcomes of discrete operations and augmenting the training data with the reasoning-type supervision and numerical expressions that lead to the correct answer.
Lastly, the most relevant class of models to consider for this work are the modular networks for reasoning. Neural Module Networks (NMN) (Gupta et al., 2020) is the first explicit reasoning based QA model which parses the query into a specialized program and executes it step-wise over learnable reasoning modules. However, to do so, apart from the exhaustive precomputation of all discrete operations, it also needs more fine-grained supervision of the gold program and the gold program execution, obtained heuristically, by leveraging the abundance of templatized queries in DROP.

Refer to caption — Figure 1: Example (passage, query, answer) from DROP and outline of our method: executing noisy program obtained from dependency parsing of query by learning date/number entity specific cross attention, and sampling and execution of discrete operations on entity arguments to reach the answer.

While being more pragmatic and richer at interpretability, both modular and hybrid networks are also tightly coupled with the additional supervision. For instance, the hybrid models cannot learn without it, and while NMN is the first to enable learning from QA pair alone, it still needs more finer-grained supervision for at least a part of the training data. With this, it manages to supercede the SoTA models NABERT and MTMSN on a carefully chosen subset of DROP using the supervision. However, NMN generalizes poorly to more open-ended settings where such supervision is not easy to handcraft.

Need for symbolic reasoning. One striking characteristic of the modular methods is to avoid discrete reasoning by employing only learnable modules with an exhaustively precomputed space of outputs. While they perform well on DROP, their modeling complexity grows arbitrarily with more complex non-linear numerical operations (e.g., $\exp$ , $\log$ , $\cos$ ). Contrarily, symbolic modular networks that execute the discrete operations are possibly more robust or pragmatic in this respect by remaining unaffected by the operation complexity. Such discrete reasoning has indeed been incorporated for simpler, well-structured tasks like math word problems (Koncel-Kedziorski et al., 2016) or KB/Table-QA (Zhong et al., 2017; Liang et al., 2018; Saha et al., 2019), with Deep Reinforcement Learning (RL) for end-to-end training. MRC however needs a more generalized framework of modular neural networks involving more fuzzy reasoning over noisy entities extracted from open-ended passages.

In view of this, we propose a Weakly-Supervised Neuro-Symbolic Module Network (WNSMN)

•

A first attempt at numerical reasoning based MRC, trained with answers as the sole supervision;
•

Based on a generalized framework of dependency parsing of queries into noisy heuristic programs;
•

End-to-end training of neuro-symbolic reasoning modules in a RL framework with discrete rewards;

To concretely compare WNSMN with contemporary NMN, consider the example in Figure 1. In comparison to our generalized query-parsing, NMN parses the query into a program form (MAX(FILTER(FIND(‘Carpenter’), ‘goal’)), which is step-wise executed by different learnable modules with exhaustively precomputed output set. To train the network, it employs various forms of strong supervision such as gold program operations and gold query-span attention at each step of the program and gold execution i.e., supervision of the passage numbers (23, 26, 42) to execute MAX operation on.

While NMN can only handle the 6 reasoning categories that the supervision was tailored to, WNSMN focuses on the full DROP with numerical answers (called DROP-num) that involves more diverse reasoning on more open-ended questions. We empirically compare WNSMN on DROP-num with the SoTA NMN and GenBERT that allow learning with partial or no strong supervision. Our results showcase that the proposed WNSMN achieves 32% better accuracy than NMN in absence of at least one or more types of supervision and performs 8% better than GenBERT when the latter is fine-tuned only on DROP in a comparable setup, without additional synthetic data having explicit supervision.

2 Model: Weakly Supervised Neuro-Symbolic Module Network

We now describe our proposed WNSMN that learns to infer the answer based on weak supervision of the QA pair by generating the program form of the query and executing it through explicit reasoning.

Parsing Query into Programs

To keep the framework generic, we use a simplified representation of the Stanford dependency parse tree (Chen & Manning, 2014) of the query to get a generalized program (Section A.5). First, a node is constructed for the subtree rooted at each child of the root by merging its descendants in the original word order. Next an edge is added from the left-most node (which we call the root clause) to every other node. Then by traversing left to right, each node is organized into a step of a program having a linear flow. For example, the program obtained in Figure 1 is X1 = (‘which is the longest’); X2 = (‘goal by Carpenter’, X1); Answer = Discrete-Reasoning(‘which is the longest’, X2). Each program step consists of two types of arguments (i) Query Span Argument obtained from the corresponding node, indicates the query segment referred to, in that program step e.g., ‘goal by Carpenter’ in Step 2 (ii) Reference Argument(s) obtained from the incoming edges to that node, refers to the previous steps of the program that the current one depends on e.g., X1 in Step 2. Next, a final step of the program is added, which has the reference argument as the leaf node(s) obtained in the above manner and the query span argument as the root-clause. This step is specifically responsible for handling the discrete operation, enabled by the root-clause which is often indicative of the kind of discrete reasoning involved (e.g., max). However this being a noisy heuristic, the QA model needs to be robust to such noise and additionally rely on the full query representation in order to predict the discrete operation. For simplicity we limit the number of reference arguments to 2.

2.1 Program Execution

Our proposed WNSMN learns to execute the program over the passage in three steps. In the preprocessing step, it identifies numbers and dates from the passage, and maintains them as separate canonicalized entity-lists along with their mention locations. Next, it learns an entity-specific cross-attention model to rank the entities w.r.t. their query-relevance (Section 2.1.1), and then samples relevant entities as discrete arguments (Section 2.1.2) and executes appropriate discrete operations on them to reach the answer. An RL framework (Section 2.1.3) trains it end-to-end with the answer as the sole supervision.

2.1.1 Entity-Specific Cross Attention for Information Extraction

To rank the query-relevant passage entities, we model the passage, program and entities jointly.

Modeling interaction between program and passage

This module (Figure 2, left) learns to associate query span arguments of the program with the passage. For this, similar to NMN, we use a BERT-base pretrained encoder (Devlin et al., 2018) to get contextualized token embeddings of the passage and query span argument of each program step, respectively denoted by ${\bm{P}}_{k}$ and ${\bm{Q}}_{k}$ for the $k$ ’th program step. Based on it, we learn a similarity matrix ${\bm{\mathsfit{S}}}\in\mathbbm{R}^{l\times n\times m}$ between the program and passage, where $l$ , $n$ , and $m$ respectively are the program length and query span argument and passage length (in tokens). Each ${\bm{S}}_{k}\in\mathbbm{R}^{n\times m}$ represents the affinity over the passage tokens for the $k$ ’th program argument and is defined as ${\bm{S}}_{k}(i,j)={\bm{w}}^{T}[{\bm{Q}}_{ki};{\bm{P}}_{kj};{\bm{Q}}_{ki}\odot{\bm{P}}_{kj}]$ , where ${\bm{w}}$ is a learnable parameter and $\odot$ is element-wise multiplication. From this, an attention map ${\bm{A}}_{k}$ is computed over the passage tokens for the $k$ ’th program argument as ${\bm{A}}_{k}(i,j)=\mathrm{softmax}_{j}({\bm{S}}_{k}(i,j))=\frac{\exp({\bm{S}}_{k}(i,j))}{\sum_{j}\exp({\bm{S}}_{k}(i,j))}$ . Similarly, for the $i$ ’th token of the $k$ ’th program argument the cumulative attention $a_{ki}$ w.r.t. the passage is $a_{ki}=\mathrm{softmax}_{i}(\sum_{j}{\bm{S}}_{k}(i,j))$ . A linear combination of the attention map ${\bm{A}}_{k}(i,\cdot)$ weighted by $a_{ki}$ gives the expected passage attention for the $k$ ’th step, $\bar{{\bm{\alpha}}}_{k}=\sum_{i}a_{ki}{\bm{A}}_{k}(i,\cdot)\in\mathbbm{R}^{m}$ .

Span-level smoothed attention. To facilitate information spotting and extraction over contiguous spans of text, we regularize the passage attention so that the attention on a passage token is high if the attention over its neighbors is so. We achieve this by adopting a heuristic smoothing technique (Huang et al., 2020), taking a sliding window of different lengths $\omega=\{1,2,\ldots 10\}$ over the passage, and replacing the token-level attention with the attention averaged over the window. This results in $10$ different attention maps over the passage for the $k$ ’th step of the program: $\{\bar{{\bm{\alpha}}}^{\omega}_{k}|\omega\in\{1,2,$ … $,10\}\}$ .

Soft span prediction. This network takes a multi-scaled (Gupta et al., 2020) version of $\bar{{\bm{\alpha}}}^{\omega}_{k}$ , by multiplying the attention map with $|{\bm{s}}|$ different scaling factors ( ${\bm{s}}=\{1,2,5,10\}$ ), yielding a $|{\bm{s}}|$ -dimensional representation for each passage token, i.e., $\bar{{\bm{\alpha}}}^{\omega}_{k}\in\mathbbm{R}^{m\times|{\bm{s}}|}$ . This is then passed through a $L$ -layered stacked self-attention transformer block (Vaswani et al., 2017), which encodes it to $m\times d$ dimension, followed by a linear layer of dimension $d\times 1$ , to obtain the span prediction logits: ${\bm{\alpha}}^{\omega}_{k}=Linear(Transformer(MultiScaling(\bar{{\bm{\alpha}}}^{\omega}_{k}))\in\mathbbm{R}^{m}$ . Further the span prediction logits at each program step (say $k$ ) is additively combined with those from the previous steps referenced in the current one, through the reference argument ( $ref(k)$ ) at step $k$ , i.e., ${\bm{\alpha}}^{\omega}_{k}={\bm{\alpha}}^{\omega}_{k}+\sum_{k^{\prime}\in ref(k)}{\bm{\alpha}}^{\omega}_{k^{\prime}}$ .

Modeling interaction between program and number/date entities

This module (Figure 2, right) facilitates an entity-based information spotting capability, that is, given a passage mention of a number/date entity relevant to the query, the model should be able to attend to the neighborhood around it. To do this, for each program step, we first compute a passage tokens to number tokens attention map ${\bm{\mathsfit{A}}}^{num}\in\mathbbm{R}^{l\times m\times N}$ , where $N$ is the number of unique number entities. Note that this attention map is different for each program step as the contextual BERT encoding of the passage tokens ( ${\bm{P}}_{k}$ ) is coupled with the program’s span argument of that step. At the $k$ -th step, the row ${\bm{A}}^{num}_{k}(i,\cdot)$ denotes the probability distribution over the $N$ unique number tokens w.r.t. the $i$ -th passage token. The attention maps are obtained by a softmax normalization of each row of the corresponding passage tokens to number tokens similarity matrix, ${\bm{S}}^{num}_{k}\in\mathbbm{R}^{m\times N}$ for $k=\{1\ldots l\}$ , where the elements of ${\bm{S}}^{num}_{k}$ are computed as ${\bm{S}}^{num}_{k}(i,j)={\bm{P}}_{ki}^{T}{\bm{W}}_{n}{\bm{P}}_{kn_{j}}$ with ${\bm{W}}_{n}\in\mathbbm{R}^{d\times d}$ being a learnable projection matrix and $n_{j}$ being the passage location of the $j$ -th number token. These similarity scores are additively aggregated over all mentions of the same number entity in the passage.

The relation between program and entities is then modeled as ${\bm{\tau}}^{\omega}_{k}=\mathrm{softmax}(\sum_{i}{\alpha}^{\omega}_{ki}{\bm{A}}^{num}_{k}(i,\cdot))\in\mathbbm{R}^{N}$ , which gives the expected distribution over the $N$ number tokens for the $k$ -th program step and using $\omega$ as the smoothing window size. The final stacked attention map obtained for the different windows is ${\mathcal{T}}^{num}_{k}=\{{\bm{\tau}}^{\omega}_{k}|\omega\in\{1,2,\ldots 10\}\}$ . Similarly, for each program step $k$ , we also compute a separate stacked attention map ${\mathcal{T}}^{date}_{k}$ over the unique date tokens, parameterized by a different ${\bm{W}}_{d}$ .

A critical requirement for a meaningful attention over entities is to incorporate information extraction capability in the number and date attention maps ${\bm{\mathsfit{A}}}^{num}$ and ${\bm{\mathsfit{A}}}^{date}$ , by enabling the model to attend over the neighborhood of the relevant entity mentions. This is achieved by minimizing the unsupervised auxiliary losses ${\mathcal{L}}^{num}_{aux}$ and ${\mathcal{L}}^{date}_{aux}$ in the training objective, which impose an inductive bias over the number and date entities, similar to Gupta et al. (2020). Its purpose is to ensure that the passage attention is densely distributed inside the neighborhood of $\pm~{}\Omega$ (a hyperparameter, e.g., 10) of the passage location of the entity mention, without imposing any bias on the attention distribution outside the neighborhood. Consequently, it maximises the log-form of cumulative likelihood of the attention distribution inside the window and the entropy of the attention distribution outside of it.

{\mathcal{L}}^{num}_{aux}=-\frac{1}{l}\sum\limits_{k=1}^{l}\bigg{[}\sum\limits_{i=1}^{m}\big{[}\log(\sum\limits_{j=1}^{N}\displaystyle{\mathbbm{1}}_{n_{j}\in[i\pm~{}\Omega]}a^{num}_{kij})-\sum\limits_{j=1}^{N}\displaystyle{\mathbbm{1}}_{n_{j}\not\in[i\pm~{}\Omega]}a^{num}_{kij}\log(a^{num}_{kij})\big{]}\bigg{]}

(1)

where $\displaystyle{\mathbbm{1}}$ is indicator function and $a^{num}_{kij}={\bm{A}}^{num}_{k}(i,j)$ . ${\mathcal{L}}^{date}_{aux}$ for date entities is similarly defined.

2.1.2 Modeling Discrete Reasoning

The model next learns to execute a single step¹¹1This is a reasonable assumption for DROP with a recall of 90% on the training set. However, it does not limit the generalizability of WNSMN, as with standard beam search it is possible to scale to an $l$ -step MDP. of discrete reasoning (Figure 3) based on the final program step. The final step contains (i) root-clause of the query which often indicates the type of discrete operation (e.g., ‘what is the longest’ indicates max, ‘how many goals’ indicates count), and (ii) reference argument indicating the previous program steps the final step depends on. Each previous step (say $k$ ) is represented as stacked attention maps $\mathcal{T}^{num}_{k}$ and $\mathcal{T}^{date}_{k}$ , obtained from Section 2.1.1.

Operator Sampling Network

Owing to the noisy nature of the program, the operator network takes as input: (i) BERT’s [CLS] representation for the passage-query pair and LSTM (Hochreiter & Schmidhuber, 1997) encoding (randomly initialized) of the BERT contextual representation of (ii) the root-clause from the final program step and (iii) full query (w.r.t. passage), to make two predictions:

•

Entity-Type Predictor Network, an Exponential Linear Unit (Elu) activated fully-connected layer followed by a $\mathrm{softmax}$ that outputs the probabilities of sampling either date or number types.
•

Operator Predictor Network, a similar Elu-activated fully connected layer followed by a $\mathrm{softmax}$ which learns a probability distribution over a fixed catalog of 6 numerical and logical operations (count, max, min, sum, diff, negate), each represented with learnable embeddings.

Apart from the diff operator which acts only on two arguments, all other operations can take any arbitrary number of arguments. Also some of these operations can be applied only on numbers (e.g., sum, negate) while others can be applied on both numbers or date (e.g., max, count).

Argument Sampling Network

This network learns to sample date/number entities as arguments for the sampled discrete operation, given the entity-specific stacked attentions ( $\mathcal{T}^{num}_{k}$ and $\mathcal{T}^{date}_{k}$ ) for each previous step (say, $k$ ), that appears in the reference argument of the final program step. In order to allow sampling of fixed or arbitrary number of arguments, the argument sampler learns four types of networks, each modeled with a $L$ -layered stacked self attention based $Transformer$ block (with output dimension $d$ ) followed by different non-linear layers embodying their functionality and a $\mathrm{softmax}$ normalization to get the corresponding probability of the argument sampling (Figure 3).

•

Sample $n\in\{1,2\}$ Argument Module: $\mathrm{softmax}(Elu(Linear_{d\times n}(Transformer(\mathcal{T}))))$ , outputs a distribution over the single entities ( $n=1$ ) or a joint distribution over the entity-pairs ( $n=2$ ).
•

Counter Module: $\mathrm{softmax}(Elu(Linear_{d\times 10}(CNN\text{-}Encoder(Transformer(\mathcal{T})))))$ , predicts a distribution over possible count values ( $\in[1,\ldots,10]$ ) of number of entity arguments to sample.
•

Entity-Ranker Module: $\mathrm{softmax}(PRelu(Linear_{d\times 1}(Transformer(\mathcal{T}))))$ , learns to rerank the entities and outputs a distribution over all the entities given the stacked attention maps as input.
•

Sample Arbitrary Argument: $Multinomial$ (Entity-Ranked Distribution, Counter Prediction).

Depending on the number of arguments needed by the discrete operation and the number of reference arguments in the final program step, the model invokes one of Sample {1, 2, Arbitrary} Argument. For instance, if the sampled operator is diff which needs 2 arguments, and the final step has 1 or 2 reference arguments, then the model respectively invokes either Sample 2 argument or Sample 1 argument on the stacked attention $\mathcal{T}$ corresponding to each reference argument. And, for operations needing arbitrary number of arguments, the model invokes the Sampling Arbitrary Argument. For the Arbitrary Argument case, the model first predicts the number of entities $\mathit{c}\in\{1,\ldots,10\}$ to sample using the Counter Network, and then samples from the multinomial distribution based on the joint of $\mathit{c}$ -combinations of entities constructed from the output distribution of the Entity Ranker module.

2.1.3 Training with Weak Supervision in the Deep RL Framework

We use an RL framework to train the model with only discrete binary feedback from the exact match of the gold and predicted numerical answer. In particular, we use the REINFORCE (Williams, 1992) policy gradient method where a stochastic policy comprising a sequence of actions is learned with the goal of maximizing the expected reward. In our case, the discrete operation along with argument sampling constitute the action. However, because of our assumption that a single step of discrete reasoning suffices for most questions in DROP, we further simplify the RL framework to a contextual multi-arm bandit (MAB) problem with a 1-step MDP, i.e., the agent performs only one step action.

Despite the simplifying assumption of 1-step MDP, the following characteristics of the problem render it highly challenging: (i) the action space $\mathcal{A}$ is exponential in the order of number of operations and argument entities in the passage (averaging to 12K actions for DROP-num); (ii) the extreme reward sparsity owing to the binary feedback is further exacerbated by the presence of spurious rewards, as the same answer can be generated by multiple diverse actions. Note that previous approaches like NMN can avoid such spurious supervision because they heuristically obtain additional annotation of the question category, the gold program or gold program execution atleast for some training instances.

In our contextual MAB framework, for an input $x$ = (passage( $p$ ), query( $q$ )), the context or environment state $s_{\phi}(x)$ is modeled by the entity specific cross attention (Section 2.1.1, parameterized by $\phi$ ) between the (i) passage (ii) program-form of the query and (iii) extracted passage date/number entities. Given the state $s_{\phi}(x)$ , the layout policy (section 2.1.2, parameterized by $\theta$ ) then learns the query-specific inference layout, i.e., the discrete action sampling policy $P_{\theta}(a|s_{\phi}(x))$ for action $a\in\mathcal{A}$ . The action sampling probability is a product of the probability of sampling entities from the appropriate entity type ( $P_{\theta}^{type}$ ), probability of sampling the operator ( $P_{\theta}^{op}$ ), and probability of sampling the entity argument(s) ( $P_{\theta}^{arg}$ ), normalized by number of arguments to sample. Therefore, with the learnable context representation $s_{\phi}(x)$ of input $x$ , the end-to-end objective is to jointly learn $\{\theta,\phi\}$ that maximises the expected reward $R(x,a)\in\{-1,+1\}$ over the sampled actions ( $a$ ), based on exact match with the gold answer.

To mitigate the learning instability in such sparse confounding reward settings, we intialize with a simpler iterative hard-Expectation Maximization (EM) learning objective, called Iterative Maximal Likelihood (IML) (Liang et al., 2017). With the assumption that the sampled actions are extensive enough to contain the gold answer, IML greedily searches for the good actions by fixing the policy parameters, and then maximises the likelihood of the best action that led to the highest reward. We define good actions ( $\mathcal{A}^{good}$ ) as those that result in the gold answer itself and take a conservative approach of defining best among them as simply the most likely one according to the current policy.

J^{IML}(\theta,\phi)=\sum_{x}\max_{a\in\mathcal{A}^{good}}\log{P_{\theta,\phi}(a|x)}

(2)

After the IML initialization, we switch to REINFORCE as the learning objective after a few epochs, where the goal is to maximise the expected reward ( $J^{RL}(\theta,\phi)=\sum_{x}\mathbb{E}_{P_{\theta,\phi}(a|x)}R(x,a)$ ) as

\nabla_{(\theta,\phi)}J^{RL}=\sum_{x}\sum_{a\in\mathcal{A}}P_{\theta,\phi}(a|x)(R(x,a)-B(x))\nabla_{\theta,\phi}(\log P_{\theta,\phi}(a|x))

(3)

where $B(x)$ is simply the average (baseline) reward obtained by the policy for that instance $x$ . Further, in order to mitigate overfitting, in addition to $L_{2}$ -regularization and dropout, we also add entropy based regularization over the argument sampling distribution, in each of the sampling networks.

3 Experiments

We now empirically compare the exact-match performance of WNSMN with SoTA baselines on versions of DROP dataset and also examine how it fares in comparison to strong supervised skylines.
The Primary Baselines for WNSMN are the explicit reasoning based NMN (Gupta et al., 2020) which uses additional strong supervision and the BERT based language model GenBERT (Geva et al., 2020) that does not embody any reasoning and autoregressively generates numeric answer tokens. As the Primary Dataset we use DROP-num, the subset of DROP with numerical answers. This subset contains 45K and 5.8K instances respectively from the standard DROP train and development sets. Originally, NMN was showcased on a very specific subset of DROP, restricted to the 6 reasoning-types it could handle, out of which three (count, date-difference, extract-number) have numeric answers. This subset comprises 20K training and 1.8K development instances, out of which only 10K and 800 instances respectively have numerical answers. We further evaluate on this numerical subset, referred to as DROP-Pruned-num. In both the cases, the training data was randomly split into 70%:30% for train and internal validation and the standard DROP development set was treated as the Test set.

Figure 4 shows the t-SNE plot of pretrained Sentence-BERT (Reimers & Gurevych, 2019) encoding of all questions in DROP-num-Test and also the DROP-Pruned-num-Test subset with different colors (red, green, yellow) representing different types. Not only are the DROP-num questions more diverse than the carefully chosen DROP-Pruned-num subset, the latter also forms well-separated clusters corresponding to the three reasoning types. Additionally, the average perplexity (using nltk) of the DROP-Pruned-num and DROP-num questions was found to be 3.9 and 10.65 respectively, further indicating the comparatively open-ended nature of the former.

For the primary baselines NMN and GenBERT, we report the performance on in-house trained models on the respective datasets, using the code open-sourced by the authors. The remaining results, taken from Geva et al. (2020), Kinley & Lin (2019), and Ran et al. (2019); refer to models trained on the full DROP dataset. All models use the same pretrained BERT-base. Also note that a primary requirement of all models other than GenBERT and WNSMN i.e., for NMN, MTMSN, NABERT, NAQANET, NumNet, is the exhaustive enumeration of the output space of all possible discrete operations. This simplifies the QA task to a classification setting, thus alleviating the need for discrete reasoning in the inference processs.

Table 1: DROP-num-Test Performance of Baselines and WNSMN.

Supervision Type			Acc. (%)
Prog.	Exec.	QAtt.
NMN-num variants
✗	✓	✓	11.77
✓	✗	✓	17.52
✓	✓	✗	18.27
✓	✗	✗	18.54
✗	✓	✗	12.27
✗	✗	✓	11.80
✗	✗	✗	11.70
GenBERT
✗	✗	✗	42.30
GenBERT-num
✗	✗	✗	38.41
WNSMN
✗	✗	✗	50.97

Table 1 presents our primary results on DROP-num, comparing the performance of WNSMN (accuracy of the top-1 sampled action by the RL agent) with various ablations of NMN (provided in the authors’ implementation) by removing atleast one of Program, Execution, and Query Attention supervision (Section A.4.1) and GenBERT models with pretrained BERT that are finetuned on DROP or DROP-num (denoted as GenBERT and GenBERT-num). For a fair comparison with our weakly supervised model, we do not treat NMN with all forms of supervision or GenBERT model pretrained with additional synthetic numerical and textual data as comparable baselines. Note that these GenBERT variants indeed enjoy strong reasoning supervision in terms of gold arithmetic expressions provided in these auxiliary datasets.

NMN’s performance is abysmally poor, indeed a drastic degradation in comparison to its performance on the pruned DROP subset reported by Gupta et al. (2020) and in our subsequent experiments in Table 2. This can be attributed to their limitation in handling more diverse classes of reasoning and open-ended queries in DROP-num, further exacerbated by the lack of one or more types of strong supervision.²²2Both the results and limitations of NMN in Table1 and 2 were confirmed by the authors of NMN as well. Our earlier analysis on the complexity of the questions in the subset and full DROP-num further quantify the relative difficulty level of the latter. On the other hand, GenBERT delivers a mediocre performance, while GenBERT-num degrades additionally by 4%, as learning from numerical answers alone further curbs the language modeling ability. Our model performs significantly better than both these baselines, surpassing GenBERT by 8% and the NMN baseline by around 32%. This showcases the significance of incorporating explicit reasoning in neural models in comparison to the vanila large scale LMs like GenBERT. It also establishes the generalizability of such reasoning based models to more open-ended forms of QA, in comparison to contemporary modular networks like NMN, owing to its ability to handle both learnable and discrete modules in an end-to-end manner.

Next, in Table 2, we compare the performance of the proposed WNSMN with the same NMN variants (as in Table 1) on DROP-Pruned-num. Some of the salient observations are:

Table 2: DROP-Pruned-num-Test Performance of NMN variants and WNSMN

. Supervision-Type Acc. (%) Count Extract-num Date-differ Prog. Exec. QAtt. NMN-num Variants ✓ ✓ ✓ 68.6 50.0 88.4 72.5 ✗ ✓ ✓ 42.4 24.1 73.9 36.4 ✓ ✗ ✓ 54.3 47.9 80.7 40.9 ✓ ✓ ✗ 63.3 45.5 81.1 68.7 ✗ ✗ ✓ 48.2 38.1 72.4 41.9 ✓ ✗ ✗ 61.0 44.7 81.1 63.2 ✗ ✓ ✗ 62.3 43.7 84.1 67.7 ✗ ✗ ✗ 62.1 46.8 83.6 66.1 WNSMN ✗ ✗ ✗ 66.5 58.8 66.8 75.2

(i) WNSMN in fact reaches a performance quite close to the strongly supervised NMN variant (first row), and is able to attain at least an improvement margin of $4\%$ over all other variants obtained by removing one or more types of supervision. This is despite all variants of NMN additionally enjoying the exhaustive precompution of the output space of possible numerical answers; (ii) WNSMN suffers only in the case of extract-number type operations (e.g., max,min) that involve a more complex process of sampling arbitrary number of arguments (iii) Performance drop of NMN is not very large when all or none of the strong supervision is present, possibly because of the limited diversity over reasoning types and query language; and (iv) Query-Attention supervision infact adversely affects NMN’s performance, in absence of the program and execution supervision or both, possibly owing to an undesirable biasing effect. However when both supervisions are available, query-attention is able to improve the model performance by $5\%$ . Further, we believe the test set of 800 instances is too small to get an unbiased reflection of the model’s performances.

In Table 3, we additionally inspect recall over the top- $k$ actions sampled by WNSMN to estimate how it fares in comparison to the strongly supervised skylines: (i) NMN with all forms of strong supervision; (ii) GenBERT variants +ND, +TD and +ND+TD further pretrained on synthetic Numerical and Textual Data and both; (iii) reasoning-free hybrid models like MTMSN (Hu et al., 2019) and NumNet (Ran et al., 2019), NAQANet (Dua et al., 2019) and NABERT, NABERT+ (Kinley & Lin, 2019). Note that both NumNet and NAQANet do not use pretrained BERT. MTMSN achieves SoTA performance through a supervised framework of training specialized predictors for each reasoning type to predict the numerical expression directly instead of learning to reason. While top-1 performance of WNSMN (in Table 1) is $4\%$ worser than NABERT, Recall@top-2 is equivalent to the strongly supervised NMN, top-5 and top-10 is comparable to NABERT+, NumNet and GenBERT models +ND, +TD and top-20 nearly achieves SoTA. Such promising recall over the top- $k$ actions suggests that more sophisticated RL algorithms with better exploration strategies can possibly bridge this performance gap.

Table 3: Skylines & WNSMN top-

k

performance on DROP-num-Test

Strongly Supervised Models	Acc. (%)
NMN-num (all supervision)	58.10
GenBERT+ND	69.20
GenBERT+TD	70.50
GenBERT+ND+TD	75.20
NAQANet	44.97
NABERT	54.27
NABERT+	66.60
NumNet	69.74
MTMSN	75.00

Recall@top- $k$ actions of WNSMN (%)
$k=2$	$k=3$	$k=4$	$k=5$	$k=10$	$k=20$
58.6	63.0	65.4	67.4	72.3	74.2

4 Analysis & Future Work

Performance Analysis

Despite the notorious instabilities of RL due to high variance, the training trend, as shown in Figure 5(a) is not afflicted by catastrophic forgetting. The sudden performance jump between epochs 10-15 is because of switching from iterative ML initialization to REINFORCE objective. Figure 5(b) shows the individual module-wise performance evaluated using the noisy pseudo-rewards, that indicate whether the action sampled by this module led to the correct answer or not (details in Section A.6). Further, by bucketing the performance by the total number of passage entities in Figure 5(c), we observe that WNSMN remains unimpacted by the increasing number of date/numbers, despite the action space explosion. On the other hand, GenBERT’s performance drops linearly beyond 25 passage entities and NMN-num degrades exponentially from the beginning, owing to its direct dependency on the exponentially growing exhaustively precomputed output space.

[Uncaptioned image] — Figure 5: (a) Training trend showing the Recall@top- $k$ and all actions, accuracy of Operator and Entity-type Predictor, estimated based on noisy psuedo rewards (Section A.6), (b) Module-wise performance (using pseudo-reward) on DROP-*num*-Test, (c) Bucketing performance by total number of passage entities for WNSMN, and the best performing NMN and GenBERT model from Table 1.

Module	Performance
Sample 1 Argument	54% (Acc.)
Sample 2 Argument	52% —"—
Counter	50% —"—
Entity Ranker	53% —"—
Operator Predictor	78% —"—
Entity Type Predictor	83% —"—
Overall Action Sampler	84% (Rec@All)

More Stable RL Framework

The training trend in Figure 5(a) shows early saturation and the module-wise performance indicates overfitting despite the regularization tricks in Section 2.1.3 and Section A.6. While more stable RL algorithms like Actor-Critic, Trust Region Policy Optimization (Schulman et al., 2015) or Memory Augmented Policy Optimization (Liang et al., 2018) can mitigate these issues, we leave them for future exploration. Also, though this work’s objective was to train module networks with weak supervision, the sparse confounding rewards in the exponential action space indeed render the RL training quite challenging. One practical future direction to bridge the performance gap would be to pretrain with strong supervision on at least a subset of reasoning categories or on more constrained forms of synthetic questions, similar to GenBERT. Such a setting would require inspection and evaluation of generalizability of the RL model to unknown reasoning types or more open-ended questions.

5 Related Work

In this section we briefly compare our proposed WNSMN to the two closest genre of models that have proven quite successful on DROP ³³3A more detailed related work section is presented in the Appendix A.4 i) reasoning free hybrid models NumNet, NAQANet, NABERT, NABERT+, MTMSN, and NeRd ii) modular network for reasoning NMN. Their main distinction with WNSMN is that in order to address the challenges of weak supervision, they obtain program annotation from the QA pairs through i) various heuristic parsing of the templatized queries in DROP to get supervision of the reasoning type (max/min, diff/sum, count, negate). ii) exhaustive search over all possible discrete operations to get supervision of the arguments in the reasoning.

Such heuristic supervision makes the learning problem significantly simpler in the following ways

•

These models enjoy supervision of specialized program that have explicit information of the type of reasoning to apply for a question e.g., SUM(10,12)
•

A simplistic (contextual BERT-like) reader model to read query related information from the passage trained with direct supervision of the query span arguments at each step of the program
•

A programmer model that can be directly trained to decode the specialized programs
•

Executing numerical functions (e.g., difference, count, max, min) either by i) training purely neural modules in a strong supervised setting using the annotated programs or by ii) performing the actual discrete operation as a post processing step on the model’s predicted program. For each of these previous works, it is possible to directly apply the learning objective on the space of decoded program, without having to deal with the discrete answer or any non-differentiability.

However, such heuristic techniques of program annotation or exhaustive search is not practical as the language of questions or the space of discrete operations become more complex. Hence WNSMN learns in the challenging weak-supervised setting without any additional annotation through

•

A noisy symbolic query decomposition that is oblivious to the reasoning type and simply based on generic text parsing techniques
•

An entity specific cross attention model extracting passage information relevant to each step of the decomposed query and learning an attention distribution over the entities of each type
•

Learning to apply discrete reasoning by employing neural modules that learn to sample the operation and the entity arguments
•

Leveraging a combination of neural and discrete modules when executing the discrete operation, instead of using only neural modules which need strong supervision of the programs for learning the functionality
•

Fundamentally different learning strategy by incorporating inductive bias through auxiliary losses and Iterative Maximal Likelihood for a more conservative initialization followed by REINFORCE

These reasoning-free hybrid models are not comparable with WNSMN because of their inability to learn in absence of any heuristic program annotation. Instead of learning to reason based on only the final answer supervision, they reduce the task to learning to decode the program, based on heuristic program annotation. NMN is the only reasoning based model that employ various auxiliary losses to learn even in absence of any additional supervision, similar to us.

To our knowledge WNSMN is the first work on modular networks for fuzzy reasoning over text in RC framework, to handle the challenging cold start problem of the weak supervised setting without needing any additional specialized supervision of heuristic programs.

6 Conclusion

In this work, we presented Weakly Supervised Neuro-Symbolic Module Network for numerical reasoning based MRC based on a generalized framework of query parsing to noisy heuristic programs. It trains both neural and discrete reasoning modules end-to-end in a Deep RL framework with only discrete reward based on exact answer match. Our empirical analysis on the numerical-answer only subset of DROP showcases significant performance improvement of the proposed model over SoTA NMNs and Transformer based language model GenBERT, when trained in comparable weakly supervised settings. While, to our knowledge, this is the first effort towards training modular networks for fuzzy reasoning over RC in a weakly-supervised setting, there is significant scope of improvement, such as employing more sophisticated RL framework or by leveraging the pretraining of reasoning.

References

Chen & Manning (2014) Danqi Chen and Christopher Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 740–750, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1082. URL https://www.aclweb.org/anthology/D14-1082.
Chen et al. (2020) Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryxjnREFwH.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL http://arxiv.org/abs/1810.04805. cite arxiv:1810.04805Comment: 13 pages.
Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, 2019.
Geva et al. (2020) Mor Geva, Ankit Gupta, and Jonathan Berant. Injecting numerical reasoning skills into language models. In ACL, 2020.
Gupta et al. (2020) Nitish Gupta, Kevin Lin, Dan Roth, Sameer Singh, and Matt Gardner. Neural module networks for reasoning over text. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SygWvAVFPr.
Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Hu et al. (2019) Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. A multi-type multi-span network for reading comprehension that requires discrete reasoning. In Proceedings of EMNLP, 2019.
Huang et al. (2020) Ting Huang, Zhi-Hong Deng, Gehui Shen, and Xi Chen. A window-based self-attention approach for sentence encoding. Neurocomputing, 375:25–31, 2020. doi: 10.1016/j.neucom.2019.09.024. URL https://doi.org/10.1016/j.neucom.2019.09.024.
Kinley & Lin (2019) Jambay Kinley and Raymond Lin. Nabert+: Improving numerical reasoning in reading comprehension. 2019. URL https://github.com/raylin1000/drop-bert.
Koncel-Kedziorski et al. (2016) Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://www.aclweb.org/anthology/N16-1136.
Liang et al. (2017) Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 23–33, 2017.
Liang et al. (2018) Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V Le, and Ni Lao. Memory augmented policy optimization for program synthesis and semantic parsing. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 10015–10027. Curran Associates, Inc., 2018.
Ran et al. (2019) Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. NumNet: Machine reading comprehension with numerical reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2474–2484, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1251. URL https://www.aclweb.org/anthology/D19-1251.
Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084.
Saha et al. (2019) Amrita Saha, Ghulam Ahmed Ansari, Abhishek Laddha, Karthik Sankaranarayanan, and Soumen Chakrabarti. Complex program induction for querying knowledge bases in the absence of gold programs. Transactions of the Association for Computational Linguistics, 7:185–200, March 2019. doi: 10.1162/tacl˙a˙00262. URL https://www.aclweb.org/anthology/Q19-1012.
Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. volume 37 of Proceedings of Machine Learning Research, pp. 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/schulman15.html.
Subramanian et al. (2020) Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolfson, Sameer Singh, Jonathan Berant, and Matt Gardner. Obtaining Faithful Interpretations from Compositional Neural Networks. In Association for Computational Linguistics (ACL), 2020.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Williams (1992) R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning, 2017.

Appendix A Appendix

A.1 Qualitative Analysis

Weakly Supervised Neuro-Symbolic Module Network	GenBERT
1. Query: how many times did a game between the patriots versus colts result in the exact same scores?, Ans: 2
Num. of Passage Entities: Date(10), Number(9)
D, N = Entity-Attention(‘how many times’) // D, N are the attention distribution over date and number entities	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘did a game between the patriots versus colts result in the exact same scores’, (D, N))	Decoder output: 2
‘Number’, ‘Count’ = EntType-Operator-Selector(‘how many times’, Query)	Span extracted: “colts”
Answer 2 = Count(N1)	Answer = 2
2. Query: how many people in chennai, in terms of percent population, are not hindu?, Ans: 19.3
Num. of Passage Entities: Date(2), Number(26)
D, N = Entity-Attention(‘how many people in chennai, in terms of percent population’)	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘are not hindu’, (D, N))	Decoder output: 19.3
‘Number’, ‘Negate’ = EntType-Operator-Selector(’are not hindu’, Query)	Span extracted: “80.7”
1 = Count(N)	Answer = 19.3
{80.7} = Sample-Arbitrary-Arguments(N1, 1)
Answer = 19.3 = Negate({80.7})
3. Query: how many more percent of the population was male than female?, Ans: 0.4
Num. of Passage Entities: Date(4), Number(29)
D, N = Entity-Attention(‘how many’)	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘more percent of the population was male’, (D, N))	Decoder output: 3.2
D2, N2 = Entity-Attention(‘than female’, (D, N))	Span extracted: “49.8”
‘Number’,‘Difference’ = EntType-Operator-Selector(’how many’, Query)	Answer = 3.2
50.2 = Sample-1-Argument(N1)
49.8 = Sample-2-Argument(N2)
Answer = 0.4 = Difference({50.2, 49.8})
4. Query: how many more, in percent population of aigle were between 0 and 9 years old than are 90 and older?, Ans: 9.8
Num. of Passage Entities: Date(0), Number(25)
D, N = Entity-Attention(‘how many more’)	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘in percent population of aigle were between 0 and 9 years old’, (D, N))	Decoder output: 1.7
D2, N2 = Entity-Attention(‘than are 90 and older’, (D, N))	Span extracted: “0.9”
‘Number’, ‘Difference’ = EntType-Operator-Selector(‘how many more’, Query)	Answer = 1.7
10.7 = Sample-1-Argument(N1)
0.9 = Sample-1-Argument(N2)
Answer = 9.8 = Difference({10.7, 0.9})
5. Query: going into the 1994 playoffs, how many years had it been since the suns had last reached the playoffs?, Ans: 3
Num. of Passage Entities: Date(3), Number(17)
D, N = Entity-Attention(‘going into the 1994 playoffs : how many years’)	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘had it been since the suns had last reached the playoffs’, (D, N))	Decoder output: 7
‘Date’, ‘Difference’ = EntType-Operator-Selector(‘going into the 1994 playoffs : how many years’, Query)	Span extracted:“1991”
{1991, 1994} = Sample-2-Argument(D)	Answer = 7
Answer = 3 = Difference({1991, 1994})
6. Query: how many more points did the cats have in the fifth game of the AA championship playoffs compared to st. paul saints?,
Ans: 3
Num. of Passage Entities: Date(3), Number(12)
D, N = Entity-Attention(‘how many’)	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘more points did the cats have in the fifth game of the AA championship playoffs’, (D, N))	Decoder output: 3
D2, N2 = Entity-Attention(‘compared to the st. paul saints’, (D, N))	Span extracted: “4 - 1 in the fifth game”
‘Number’, ‘Difference’ = EntType-Operator-Selector(‘how many’, Query)	Answer = 3
5.0 = Sample-1-Argument(N1)
2.0 = Sample-1-Argument(N2)
Answer = 3.0 = Difference({5.0, 2.0})
7. Query: how many total troops were there in the battle?, Ans: 40000
Num. of Passage Entities: Date(1), Number(3)
D, N = Entity-Attention(‘how many total troops’)	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘were there in the battle’, (D, N))	Decoder output: 100000
‘Number’, ‘Sum’ = EntType-Operator-Selector(’how many total troops’, Query)
2 = Count(N1)	Span extracted: “10000 korean troops”
{10000.0, 30000.0} = Sample-Arbitrary-Arguments(N1, 2)	Answer = 100000
Answer = 40000.0 = Sum({10000.0, 30000.0})

Weakly Supervised Neuro-Symbolic Module Network	NMN-num	GenBERT
8. Query: how many field goals did sebastian janikowski and kris brown both score each? Ans: 2
Num. of Passage Entities: Date(0), Number(9)
D, N = Entity-Attention(‘how many field goals’)	P1 = Find-Passage-Attention()	Predicted AnsType: Decoded
D1, N1 = Entity-Attention(‘did sebastian janikowski and kris brown both score each’, (D, N))	P2 = Filter-Passage-Attention(P1)	Decoder output: 2
‘Number’, ‘Count’ = EntType-Operator-Selector(‘how many field goals’, Query)	2 = Passage-Attn-To-Count(P2)	Span extracted: “33 - yard”
Answer = 2.0 = Count(N1)	Answer = 2	Answer = 2
9. Query: how many years was between the oil crisis and the energy crisis? Ans: 6
Num. of Passage Entities: Date(19), Number(14)
D1, N1 = Entity-Attention(‘was between the oil crisis and the energy crisis’)	year-diffs $\in\mathbbm{R}^{40}$ (// generated exhaustive output space of all differences)	Predicted AnsType: Decoded
D, N = Entity-Attention(‘how many years’, D1, N1)	P1 = Find-Passage-Attention()	Decoder output: 3
‘Date’, ‘Difference’ = EntType-Operator-Selector(‘how many years’, Query)	6 = Year-Difference(P1, year-diffs)	Span extracted: “1973”
{1973, 1979} = Sample-2-Argument(D)	Answer = 6.0	Answer = 3
Answer = 6.0 = Difference({1973, 1979})
10. Query: how many yards was the longest touchdown pass? Ans: 40
Num. of Passage Entities: Date(0), Number(5)
D, N = Entity-Attention(‘how many yards was the’)	P1 = Find-Passage-Attention()	Predicted AnsType: Extract-Span
D1, N1 = Entity-Attention(‘longest touchdown pass’, (D, N))	N1 = Find-Passage-Number(P1)	Decoder output: 43
‘Number’, ‘Sum’ = EntType-Operator-Selector(‘how many yards was the’, Query)	40 = Find-Max-Num(N1)	Span extracted: “40”
1 = Count(N)	Answer = 40	Answer = 40
{40.0} = Sample-Arbitrary-Argument(N, 1)
Answer = 40.0 = Sum({40.0})

Table 4: Example questions from DROP-num along with predictions of the Proposed model WNSMN and the best performing versions of the NMN-num and GenBERT baselines from Table 1. Detailed elaborations of outputs of these three models below:
(i) WNSMN first parses the dependency structure in the query into a program-form. Next, for each step of the program, it generates an attention distribution over the date and number entities. Entity-Attention refers to that learnt entity-specific cross attention described in Section 2.1.1. It then performs the discrete reasoning by sampling an operation and specific entity-arguments, in order to reach the answer. EntType-Operator-Selector refers to the Entity-Type and Operator Predictor in Operator Sampling Network and Sample-*-Argument refers to the Argument Sampling Network described in Section 2.1.2. Sum/Difference/Logical-Not are some of the discrete operations that are executed to get the answer. In some of the cases, (e.g., Query 3.) despite wrong parsing the model was able to predict the correct operation even though the root clause did not have sufficient information. In Query 10., the correct operation is Max, but WNSMN reaches the right answer by sampling only the maximum number entity through the Sample-Arbitrary-Argument network and then applying a spurious Sum operation on it.
(ii) On the other hand, the steps of the program generated by NMN-num first compute or further filter attention distribution over the passage or entities which are then fed into the learnable modules (Passage-Attn-To-Count, Year-Difference) that predict the answer. In order to do so, it needs to precompute all possible outputs of numerical operations that generate new numbers for e.g. year-diffs in Example 9. Because of the relatively poorer performance of NMN-num, its outputs are only reported for the last 3 instances, which were cherrypicked based on NMN-num’s predictions.
(iii) GenBERT first predicts whether the answer should be decoded or extracted from passage span and accordingly uses the Decoder output or extracted span as the answer. By design, the modular networks provide a more interpretable output than the monolithic encoder-decoder model GenBERT.

A.2 Implementation & Pseudo-Code

The source-code and models pertaining to this work would be open-sourced on acceptance of this work. A detailed pseudo-code of the WNSMN algorithm is provided below.

Algorithm 1 WNSMN Algorithm

Input: (Query (

q

), Passage (

p

)) =

x

Output (or Supervision): Answer(

y

)

\in\mathbbm{R}

Preprocessing:

[

num_{1}

num_{2}

\ldots

num_{N}

] =

Num

= Extract-Numbers(

p

) // Number and Date

[

date_{1}

date_{2}

\ldots

date_{D}

] =

Date

= Extract-Dates(

p

)// Entity and Passage Mentions

Inference:

[(q_{1},ref_{1}),\ldots(q_{k},ref_{k}),\ldots(q_{l},ref_{l})]

Program

= Query-Parsing(

q

)

for step

(q_{k},ref_{k})\in Program

({\bm{A}}^{num}_{k}

\mathcal{T}^{num}_{k}

), (

{\bm{A}}^{date}_{k}

\mathcal{T}^{num}_{k}

) = Entity-Attention(

q_{k}

p

ref_{k}

Num

Date

)Section 2.1.1

end for

{\mathcal{L}}^{num}_{aux}

{\mathcal{L}}^{date}_{aux}

= Entity-Inductive-Bias(

{\bm{\mathsfit{A}}}^{num}

{\bm{\mathsfit{A}}}^{date}

)Equation 1

{\mathcal{L}}_{aux}={\mathcal{L}}^{num}_{aux}+{\mathcal{L}}^{date}_{aux}

q_{l}

= Query Span Argument of Last Step // Program Arguments and Stacked Attention

ref_{l}

= Reference Argument of Last Step // Map over Entities for Last Step

\mathcal{T}^{num}

= {

{\mathcal{T}^{num}_{k}|k\in ref_{l}}

\mathcal{T}^{date}

= {

{\mathcal{T}^{date}_{k}|k\in ref_{l}}

}

Operators

= {

op_{1}

op_{2}

, …,

op_{k1}

} = Operator-Predictor(

q_{l}

q

) // Operator and EntityType

EntTypes

= {

type_{1}

type_{2}

, …,

type_{k1}

} = Entity-Type-Predictor(

q_{l}

q

)// Sampling

Actions

= {}// Action Sampling for each Operator

for

op

type

\in

(

Operators

EntTypes

) do

type

is Number then

\mathcal{T}=\mathcal{T}^{num}

else if

type

is Date then

\mathcal{T}=\mathcal{T}^{date}

end if

op

is diff then

|ref_{l}|==2

then

arg1

= {

arg1_{1}

arg1_{2}

\ldots

arg1_{k2}

} = Sample-1-Argument(

\mathcal{T}_{0}

)

arg2

= {

arg2_{1}

arg2_{2}

\ldots

arg2_{k2}

} = Sample-1-Argument(

\mathcal{T}_{1}

)

args

= {

(a1,a2)|\hskip 5.0pt(a1,a2)\in(arg1,arg2)

}

else if

|ref_{l}|==1

then

args

= {

arg_{1}

arg_{2}

\ldots

arg_{k2}

} = Sample-2-Argument(

\mathcal{T}_{0}

)

end if

else if

op

is count then

args

= {

count_{1}

count_{2}

\ldots

count_{k2}

} = Count-Network(

\sum_{j}\mathcal{T}_{j}

)

else

args

= {

arg_{1}

arg_{2}

\ldots

arg_{k2}

} = Sample-Arbitrary-Argument(

\sum_{j}\mathcal{T}_{j}

)

end if

probs

= {

(p^{type}*p^{op}*p)|p\in p^{arg}\}\in\mathbbm{R}^{k2}

// $p$ ’s refer to the corresponding probabilities

answers

= {Execute-Discrete-Operation(

type

op

arg

)

|\hskip 5.0ptarg\in args\}\in\mathbbm{R}^{k2}

actions

= {(

prob

answer

)

|\hskip 5.0ptprob\in probs,answer\in answers\}

Actions

Actions

\cup

actions

end for

Training:

for

i\in\{1,\ldots,N_{IML}+N_{RL}\}

for

(x,y)\in\mathcal{D}

\mathcal{A}(x)\longleftarrow Actions

sampled for input(

x

) // Using above Algorithm

R(x,a,y)\longleftarrow

Exact Match Reward for action

a

for instance

x

with gold answer

y

i\leq N_{IML}

then

(\theta,\phi)\longleftarrow\displaystyle\max_{\theta,\phi}J^{IML}

over (

\mathcal{A},R)+\displaystyle\min_{\phi}{\mathcal{L}}_{aux}

$J^{IML}$ from Equation 2

else

(\theta,\phi)\longleftarrow\displaystyle\max_{\theta,\phi}J^{RL}

over (

\mathcal{A},R)+\displaystyle\min_{\phi}{\mathcal{L}}_{aux}

$J^{RL}$ from Equation 3

end if

end for

A.3 Qualitative Inspection of WNSMN Predictions

Good Action: Action Resulting in exact match with gold answer
Correct Action: Action Manually annotated to be correct
Number of test instances (DROP-num Test)	5800
Number of instances with atleast 1 good action	4868
Number of instances with more than 1 good action	2533
Average number of good actions (where there is atleast 1 good action)	1.5
Average number of good actions (where there is more than 1 good action)	2.25
Number of instances where the top-1 action is good action	2956
Number of instances where top-1 is the only good action	2335 (79% of 2956)
Number of instances with possibility of top-1 action being spuriously good	620 (21% of 2956)
Number of instances manually annotated (out of possible cases of spurious top-1 action)	334 (out of 620)
Number of instances where top-1 action is found to be spurious	28 (8.4% of 334)
Avg Ratio of Probability of Top Action and Maximum Probability of all other spuriously good actions (if any)	4.4e+11

Table 5: Analysis of the predictions of WNSMN on DROP-num Test

Generic Observations/Notes

•

Note: When the model selects a single number in the Argument Sampling network and the Operator sampled is not of type count, we forcefully consider the operation as a NO-OP. For example sum/min/max over a single number or date is treated as NO-OP.
•

One potential source of spuriously correct answer is the neural ‘counter’ module which can predict numbers in [1, 10]. However, out of the cases where atleast one of the top-50 actions is a good action we observe that the model is able to learn when the answer is directly present as an entity or can be obtained through (non count) operations over other entities and when it cannot be obtained directly from the passage but needs to aggregate (i.e., count) over multiple entities. Table 8 below gives some examples of hard instances where the WNSMN Top-1 prediction was found to be correct.

True Reasoning	Model Prediction	Count
negate a passage entity i.e., 100 - number	the model was able to select negate of the correct entity as the top action.	34
min/max of a set of passage entities	the model instead directly sampled the correct minimum/maximum entity as a single argument and then applied NO-OP operation over it.	11
select one of the passage entities	the model was able to select the right entity and apply NO-OP on it as the top action.	18
count over passage entities	the model was able to put count as the top action and the spurious actions came much lower with almost epsilon probability	88
difference over passage entities (the same answer could be spuriously obtained by other non-difference operations over unrelated entites)	the model was able to put difference as the top action and the spurious actions came much lower with almost epsilon probability	89
difference over passage entities (the same answer could be spuriously obtained by difference over other unrelated entities)	the model was able to put difference over the correct arguments as the top action	66

Table 6: Case Study of the 306 instances manually annotated as Correct out of 334 instances

True Reasoning	Model Prediction	Count
difference of dates/months	count over years	4
sum(number1, count([number2])	count over numbers	1
difference between entities	sum over two arguments (both arguments wrong)	1
difference between entities	difference over two arguments (both arguments wrong)	1
difference between entities	count over entities	1
difference between entities	sum over arguments (one correct) (correct action was taken in one of the other top-5 beams)	2
question is vague/incomplete/could not be answered manually	count or difference	2
counting over text spans (Very rare type of question, only 2 found out of 334)	wrong operator	2
miscelleneous	wrong operator	7
miscelleneous	correct operator wrong arguments (one correct)	2
miscelleneous	correct operator wrong arguments (all wrong)	5

Table 7: Case Study of the 28 instances manually annotated as Wrong out of 334 instances.

Question	Relevant Passage Excerpt	Model Prediction Analysis
How many printing had Green Mansions gone through by 1919?	“W. H. Hudson which went through nine printings by 1919 and sold over 20,000 copies…. ”	Model was able to rank the operation sum([9.0]) highest. the count-number operator had near-epsilon probability, indicating that indeed it did not find any indication of the answer being 9 by counting entities over the passage. This is despite the fact that most of the ”how many” type questions need counting.
The Steelers finished their 1995 season having lost how many games difference to the number of games they had won?	“In 1995, the Steelers overcame a 3-4 start (including a 20-16 upset loss to the expansion 1995 Jacksonville Jaguars season) to win eight of their final nine games and finished with an record, the second-best in the AFC”.	Model had to avoid distracting numbers (3,4) and (20,16) to understand that the correct operation is difference of (9-8)
How many more field goals did Longwell boot over Kasay?	“26-yard field goal by kicker Ryan Longwell … Carolina got a field goal with opposing kicker John Kasay. … Vikings would respond with another Longwell field goal (a 22-yard FG) … Longwell booted the game-winning 19-yard field goal ”	Question needed counting of certain events and none of these appeared as numbers. Model was able to apply count over number entities correctly
How many delegates were women from both the Bolshevik delegates and the Socialist Revolutionary delegates?	“Of these mandatory candidates, only one Bolshevik and seven Socialist Revolutionary delegates were women.”	Model was able to apply sum on the correct numbers, even though many of the ”how many” type questions need counting
How many years in a row did the GDP growth fall into negatives?	“Growth dropped to 0.3% in 1997, -2.3% in 1998, and -0.5% in 1999.”	Model had to understand which numbers are ”negative”. It also needed to understand to count the two events instead of taking difference of the years
At it’s lowest average surface temperature in February, how many degrees C warmer is it in May?	“The average surface water temperature is 26-28 C in February and 29 C in May.”	Passage had distrative unrelated numbers in the proximity but the model was able to select the lowest temperature out of (26,28) and then take difference of (29-26)
How many years ibefore the blockade was the Uqair conference taken place?	“Ibn Saud imposed a trade blockade against Kuwait for 14 years from 1923 until 1937… At the Uqair conference in 1922, … ”	Passage had other distracting unrelated numbers in the proximity but the model was able to select the correct difference operation

Table 8: Manual Analysis of a few hard instances (with Question and Relevant Passage Excerpt) where WNSMN top-1 prediction was found to be correct

A.4 Background: Numerical Reasoning over Text

The most generic form of Numerical reasoning over text (NRoT) is probably encompassed by the machine reading comprehension (MRC) framework (as in Dua et al. (2019)), where given a long passage context, $c$ , the model needs to answer a query $q$ , which can involve generating a numerical or textual answer or selecting a numerical quantity or span of text from the passage or query. The distinguishing factor from general RC is the need to perform some numerical computation using the entities and numbers in the passage to reach the goal.

Discrete/symbolic reasoning in NRoT: In the early NRoT datasets hosseini-etal-2014-learning; roy-roth-2015-solving; Koncel-Kedziorski et al. (2016) which deal with simpler math word problems with a small context and few number entities, symbolic techniques to apply discrete operations were quite popular. However, as the space of operations grow or the question or the context becomes more open-ended these techniques fail to generalize. Incorporating explicit reasoning in neural models as discrete operations requires handling non-differentiable components in the network which leads to optimization challenges.

Discrete reasoning using RL: Recently Deep Reinforcement Learning (DRL) has been employed in various neural symbolic models to handle discrete reasoning, but mostly in simpler tasks like KBQA, Table-QA, or Text-to-SQL Zhong et al. (2017); Liang et al. (2018; 2017); Saha et al. (2019); ijcai2019-679; DBLP:conf/iclr/NeelakantanLAMA17. Such tasks can be handled by well-defined components or modules, with well structured function-prototypes (i.e., function arguments can be of specific variable-types e.g., KB entities or relations or Table row/column/cell values), which can be executed entirely as a symbolic process. On the other hand, MRC needs more generalized frameworks of modular networks involving fuzzy forms of reasoning, which can be achieved by learning to execute the query over a sequence of learnable neural modules, as explored in Gupta et al. (2020). This was inspired by the Neural Modular Networks which have proved quite promising for tasks requiring similar fuzzy reasoning like Visual QA DBLP:conf/cvpr/AndreasRDK16; DBLP:journals/corr/AndreasRDK15.

SoTA models on DROP:

While the current leaderboard-topping models already showcase quite superior performance on the reasoning based RC task, it needs closer inspection to understand whether the problem has been indeed fully solved.

Pre-trained Language Models: On one hand, the large scale pretrained language models Geva et al. (2020) use Transformer encoder-decoder (with pretrained BERT) to emulate the input-output behavior, decoding digit-by-digit for numeric and token-by-token for span based answers. However such models perform poorly when only trained on DROP and need additional synthetic dataset of numeric expressions and DROP-like numeric textual problems, each augmented with the gold numeric expression form.

Reasoning-free Hybrid Models: On the other hand, a class of hybrid neural models have also gained SoTA status on DROP by explicitly handling the different types of numerical computations in the standard extractive QA pipeline. Most of the models in this genre, like NumNet (Ran et al. (2019)), NAQANet (Dua et al. (2019)), NABERT+(Kinley & Lin (2019)), MTMSN (Hu et al. (2019)) and NeRd (Chen et al. (2020)) do not actually treat it as a reasoning task; instead they precompute an exhaustive enumeration of all possible outcomes of numerical and logical operations (e.g., sum/diff, negate, count, max/min) and augment the training data with knowledge of the query-type (depending on reasoning-type) and all the numerical expression that leads to the correct answer. This reduces the question-answering task to simply learning a multi-type answer predictor to classify into the reasoning-type and directly predict the numerical expression, thus alleviating the need for rationalizing the inference or handling any (non-differentiable) discrete operation in the optimization. Some of the initial models in this genre are NAQANet(Dua et al. (2019) and NumNet (Ran et al. (2019)) which are respectively numerically aware enhancements of QANet(wei2018fast) and the Graph Neural Networks. These were followed by BERT-based models, NABERT and NABERT+(Kinley & Lin (2019)), i.e. a BERT version of the former, enhanced with standard numbers and expression templates for constraining numerical expressions. MTMSN Hu et al. (2019) models a specialized multi-type answer predictor designed to support specific answer types (e.g., count/negation/add/sub) with supervision of the arithmetic expressions that lead to the gold answer, for each type.

Modular Networks for Reasoning: NMN (Gupta et al., 2020) is the first model to address the QA task through explicit reasoning by learning to execute the query as a specialized program over learnable modules tailored to handle different types of numerical and logical operations. However, to do so, it further needs to augment the training data with annotation of the gold program and gold program execution i.e. the exact discrete operation and numerical expression (i.e., the numerical operation and operands) that leads to the correct answer for e.g., the supervision of the gold numerical expression in Figure 1 is SUM(23, 26, 42). This is usually obtained through manual inspection of the data through regex based pattern matching and heuristics applied on the query language. However, because of the abundance of templatized queries in DROP this pattern matching is infact quite effective and noise-free, resulting in the annotations acting as strong supervision.

However such a manual intensive process severely limits the overall model from scaling to more general settings. This is especially true for some of the previous reasoning based models, NABERT+, NumNet and MTMSN which perform better on than NMN (infact achieve SoTA performance) on the full DROP dataset. But we do not consider them as our primary baselines, as, unlike NMN, these models (Hu et al. (2019); Efrat2019TagbasedME; Dua et al. (2019); Ran et al. (2019)) do not have any provision to learn in absence of the additional supervision generated through exhaustive enumeration and manual inspection. (Gupta et al., 2020) have been the first to train a modular network strong, albeit a more fine-grained supervision for a fraction of training data, and auxiliary losses that allow them to learn from the QA pairs alone. Consequently on a carefully-chosen subset of DROP, NMN showcased better performance than NABERT and MTMSN, when strong supervision is available only for partial training data.

Our work takes it further along the direction in two ways

•

while NMN baseline can handle only 6 specific kinds of reasoning, for which they tailored the program generation and gold reasoning annotation, our model works on the full DROP-num, that involves more diverse kinds of reasoning or more open-ended questions, and requires evaluating on a subset $\times$ 7.5, larger by training on $\times$ 4.5 larger training data.
•

while NMN generalized poorly on the full DROP-num, especially when only one or more types of supervision is removed, our model performs significantly better without any of these types of supervision.

Together, NMN and GenBERT are some of the latest works in the two popular directions (reasoning and language model based) for DROP that allow learning with partial no strong supervision and hence act as primary baselines for our model.

Since in this work we are investigating how neural models can incorporate explicit reasoning, we focus on only answering questions having numerical answer (DROP-num), where we believe the effect of explicit reasoning is more directly observeable. This is backed up by the category-wise performance comparison of reasoning-free language model GenBERT (reported in Geva et al. (2020)) with other hybrid models (MTMSN and NABERT+) that exploit numerical computation required in answering DROP questions. While, on DROP-num, there is an accuracy gap of 33% between the GenBERT model and the hybrid models (when all are trained on DROP only), there is only a 2-3% performance gap on the subset having answers as single span, despite the latter also needing reasoning. This evinces that the performance gap is indeed due to exploiting explicit reasoning under such strong supervised settings.

A.4.1 Limitations of NMN

The primary motivation behind our work comes from some of the limitations of the contemporary neural module networks, NMN and the reasoning-free hybrid models MTMSN, NABERT+, NumNet, NAQANet; specifically their dependence on the availability of various kinds of strong supervision. For that we first describe the nature of programmatic decompositions of queries used in the modular architectures in the closest comparable work of NMN.

NMN defined a program structure with modules like ‘find’, ‘filter’, ‘relocate’, ‘find-num’, ‘find-date’, ‘year-difference’, ‘max-num’, ‘min-num’, ‘compose-number’ etc., to handle a carefully chosen subset of DROP showcasing only 6 types of reasoning, (i.e. Date-Difference, Count, Extract Number, Number Compare). For e.g. for the query Which is the longest goal by Carpenter? the program structure would be (MAX(FILTER(FIND(‘Carpenter’), ‘goal’)), where each of these operations are learnable networks. However to facilitate learning of such specialized programs and the networks corresponding to these modules, the model needs precomputation of the exhaustive output space for different discrete operation and also various kinds of strong supervision signals pertaining to the program generation and execution.

Precomputation of the Exhaustive Output-Space: For operations that generate a new number as its output (e.g., sum/diff), the annotation enumerates the set of all possible outputs by computing over all subsets of number or date entities in the passage. This simplifies the task by allowing the model to directly learn to optimize the likelihood of the arithmetic expression that lead to the final answer, without any need for handling discrete operations.

Program Supervision provides supervision of the query category out of the 6 reasoning categories, on which their program induction grammar is tailored to. With this knowledge they can directly use the category specific grammar to induce the program ( for e.g. SUM(FILTER(FIND)) in Fig 1). Further all these models (NMN, MTMSN, NABERT+, NumNet, NAQANet) use the supervision of the query category to understand whether the discrete operation is of type count or add/sub or max/min. which includes the knowledge of the ‘gold’ discrete operation (i.e. count or max/min or add/sub) to perform.

Query Attention Supervision provides information about the query segment to attend upon in each step of the program, as the program argument for e.g. in Fig 1, ‘Carpenter’ and ‘goal’ in the 1st and 2nd step of the program.

Execution Supervision: For operations that select one or more of the number/date entities in the passage, (for e.g. max/min), rule based techniques provide supervision of the subset of numbers or dates entities from the passage, over which the operation is to be performed.

These annotations are heuristically generated through manual inspection and regular expression based pattern matching of queries, thus limiting their applicability to a small subset of DROP only. Furthermore, using a hand-crafted grammar to cater to the program generation for each of their reasoning categories, hinders their generalizability to more open ended settings. While this kind of annotation is feasible to get in DROP, this is clearly not the case with other futuristic datasets, with more open-ended forms of query, thus calling for the need for other paradigms of learning that do not require such manually intensive annotation effort.

A.4.2 Pretraining Data for GenBERT

While GenBERT (Geva et al. (2020)) greatly benefits from pretraining on synthetic data, there are few notable aspects of how the synthetic textual data was carefully designed to be similar to DROP. The textual data was generated for the same two categories nfl and history as DROP with similar vocabulary and involving the same numerical operations over similar ranges of numbers (2-3 digit numbers for DROP and 2-4 digit numbers for synthetic textual data). The intentional overlap between these two datasets is evident from the t-SNE plots (in Figure 6) of the pretrained Sentence-Transformer embedding of questions from DROP-num (blue) and the Synthetic Textual Data (red). Further, while the generalizability of GenBERT was tested on add/sub operations from math word problems (MWP) datasets ADD-SUB, SOP, SEQ, their synthetic textual data was also generated using the same structure involving world state and entities and verb categories used by hosseini-etal-2014-learning to generate these MWP datasets. Such bias limits mitigates the real challenges of generalizability, limiting the true test of robustness of such language models for numerical reasoning.

A.5 Query Parsing: Details

The Stanford Dependency parse tree of the query is organized into a program structure as follows

•

Step 1) A node is constructed out of the subtrees rooted at each immediate child of the root, the left-most node is called the root-clause
•

Step 2) Traversing the nodes from left to right, an edge is added between the left-most to every other node, and each of these are added as steps of the program with the node as the query span argument of that step and the reference argument as the incoming edges from past program steps
•

Step 3) The terminal (leaf) nodes obtained in this manner are then further used to add a final step of the program which is responsible for handling the discrete operation. The query-span argument of this step is the root-clause, which often is indicative of the kind of discrete reasoning to perform. The reference arguments of this step are the leaf nodes obtained from Step 2).

Figure 7 provides some example queries similar to those in DROP along with their Dependency Parse Tree and the Simplified Representation obtained by constructing the nodes and edges as in Step 1) and 2) above, and the final program which is used by WNSMN. Note that in this simplified representation of the parse tree the root-word of the original parse tree is absorbed in its immediate succeeding child. Also we simplify the structure in order to limit the number of reference arguments in any step of the program to 2, which in turn requires the number of terminal nodes (after step 2 of the above process) to be limited to 2. This is done in our left to right traversal by collapsing any additional terminal node into a single node.

A.6 RL Framework: Details

In this section we discuss some additional details of the RL framework and tricks applied in the objective function
Iterative ML Objective: In absence of supervision of the true discrete action that leads to the correct answer, this iterative procedure fixes the policy parameters to search for the good actions (where $\mathcal{A}^{good}=\{a:R(x,a)=1\}$ ) and then optimizes the likelihood of the best one out of them. However, the simple, conservative approach of defining the best action as the most likely one according to the current policy can lead to local minima and overfitting issues, especially in our particularly sparse and confounding reward setting. So we take a convex combination of a conservative and a non-conservative selection that respectively pick the most and least likely action according to the current policy out of $\mathcal{A}^{good}$ as best. Hyperparameter $\lambda$ weighs these two parts of the objective and is chosen to be quite low $(1e^{-3})$ , to serve the purpose of an epsilon-greedy exploration strategy without diverging significantly from the current policy.

J^{IML}(\theta,\phi)=\sum_{x}(1-\lambda)\max_{a\in\mathcal{A}^{good}}\log{P_{\theta,\phi}(a|x)}+\lambda\min_{a\in\mathcal{A}^{good}}\log{P_{\theta,\phi}(a|x)}

Using Noisy Pseudo-Reward: In addition to using the REINFORCE objective to maximise the likelihood of actions that lead to the correct answer, we can also obtain different noisy pseudo rewards ( $\in\{-1,+1\}$ ) for the different modules that contribute towards the action sampling (i.e. the operator and the entity-type and different argument sampler networks). Towards this end, we define pseudo-reward for sampling an operator as the maximum of the reward obtained from all the actions involving that operator. Similarly, we can also define reward for predicting the entity-type (date or number) over which the discrete operation should be executed. Following the same idea, we also obtain pseudo rewards for the different argument sampling modules. For e.g. if the most likely operator (as selected by the Operator Sampler) is of type count and it gets a pseudo-reward of $+1$ , then, in that case, we can use the reward obtained by the different possible outputs of the Counter network as a noisy pseudo-label supervision and subsequently add an explicit loss of negative log-likelihood to the final objective for the Counter module. Similar pseudo-reward can be designed for the Entity-Ranker module when the most likely operator sampled by the Operator Sampler needs arbitrary number of arguments. Treating the pseudo-reward as a noisy label can lead to a negative-log-likelihood based loss on output distribution from the Entity-Ranker, following the idea that the correct entities should atleast be ranked high so as to get selected when sampling any arbitrary number of entities.