This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Teaching Neural Module Networks to Do Arithmetic

Jiayi Chen Xiao-Yu Guo Yuan-Fang Li Gholamreza Haffari
Abstract

Answering complex questions that require multi-step multi-type reasoning over raw text is challenging, especially when conducting numerical reasoning. Neural Module Networks (NMNs), follow the programmer-interpreter framework and design trainable modules to learn different reasoning skills. However, NMNs only have limited reasoning abilities, and lack numerical reasoning capability. We upgrade NMNs by: (a) bridging the gap between its interpreter and the complex questions; (b) introducing addition and subtraction modules that perform numerical reasoning over numbers. On a subset of DROP, experimental results show that our proposed methods enhance NMNs’ numerical reasoning skills by 17.7% improvement of F1 score and significantly outperform previous state-of-the-art models.

Refer to caption
Figure 1: Three examples in the DROP dataset and the predictions by original NMNs and our improved model NMNs±. The relevant tokens and their corresponding modules are highlighted.

1 Introduction

Complex Question Answering (CQA) over text is a challenging task in Natural Language Understanding (NLU). Based on the programmer-interpreter paradigm, Neural Module Networks (NMNs) Gupta et al. (2020) learn to first parse complex questions as executable programs composed of various predefined trainable modules, and then execute such programs (implemented by modules) over the given paragraph to predict answers of all kinds. NMNs achieve competitive reasoning performance on a subset of DROP Dua et al. (2019), and possess remarkable interpretability that is also important for CQA.

However, NMNs’ numerical reasoning capability is insufficient: it is incapable of handling arithmetic operations such as addition and subtraction between numbers, which make up nearly 40% questions of the DROP dataset. Moreover, a gap exists between the interpreter and the complex question since there is no interaction between them. Motivated by these, we propose two methods to improve NMNs’ numerical reasoning skills.

First, we incorporate the original question in the interpreter, aiming to directly provide question information in the “execution” process, especially number-related questions. The intuition behind is that, in the original NMNs, questions participate in the process only through the programmer. This can cause a distance between queries and returns. For example, in Figure 1, the first row shows that the original NMNs found the wrong event (i.e., ‘besieged Sinj’) solely based on the paragraph information. In contrast, our model NMNs± can easily target the correct event (i.e., ‘Sinj finally fell’) with the help of question information.

Second, we introduce new modules to support addition and subtraction of up to three numbers. Endowing NMNs with the ability to support arithmetic can greatly boost its overall performance on DROP and beyond. For instance, in Figure 1, the second row shows that the original NMNs improperly adopt the find-num module for the addition question because the module set does not cover such an arithmetic ability. To facilitate the learning of the add/sub modules, we extract QA pairs related to addition and subtraction from the original DROP dataset to construct a new dataset for training and evaluation.

Experimental results show that our methods significantly enhance NMNs’ numerical reasoning capability. On a subset of DROP, our methods improve F1 score by 17.7% absolute points, and on add-sub questions by 65.7% absolute points. Compared to NumNet Ran et al. (2019), which is specifically designed for numerical reasoning, our method outperforms it by 2.9% absolute F1 points.

2 Background and Related Work

Semantic Parsing is a widely-adopted approach in the compositional question answering (CQA) task, which involves a number of reasoning steps. In this approach, a programmer maps natural-language questions into machine-readable representations (logical forms), which are executed by an interpreter to yield the final answer. For instance, WNSMN Saha et al. (2021) uses a generalized framework of dependency parsing inspired by the Stanford dependency parse tree Chen and Manning (2014) to parse queries into noisy heuristic programs. Neural Module Networks Gupta et al. (2020) extend semantic parsing by making interpreter a learnable function with specified modules and executing the logical forms from the programmer in a step-wise manner.

Neural Module Networks initially is proposed to overcome the Visual Question Answering (VQA) problem Andreas et al. (2016), where questions are often compositional. Gupta et al. (2020) employs the programmer-interpreter framework with attention Vaswani et al. (2017) to tackle the CQA task. Specifically, the programmer parses each question into an executable program. The interpreter takes the program as input and perform various symbolic reasoning functions. The modules are defined in a differentiable way, aiming to maintain the uncertainty about each intermediate decision output and propagate them through layers. For instance, the predicted program of the first example in Figure 1 is span(compare-date-lt(find,find)). The interpreter would first calls the find module twice to find events queried by the question (e.g., ‘the fell of Sinj’) and outputs appropriate paragraph attention. The compare-date-lt module can further locate the dates (e.g., ‘30 September 1686’) to compute their relation. By demonstrating the intermediate reasoning steps in this manner, NMNs perform interpretable problem-solving.

Numerical Reasoning is a necessary ability for models to handle the CQA task Geva et al. (2020). Dua et al. (2019) modify the output layer of QANet Yu et al. (2018) and propose a number-aware model NAQANet to deal with numerical questions. NumNet Ran et al. (2019) leverage Graph Neural Network to capture relations between numbers. Similarly, QDGAT Chen et al. (2020a) distinguish number types more precisely by adding the connection with entities and obtained better performance. Nerd Chen et al. (2020b) search possible programs exhaustively based on answers and employed these programs as weak supervision. Another similar work Guo et al. (2021) proposes a question-aware interpreter but uses a totally different approach to measure the alignment between the question and the context paragraph. Though these approaches can achieve the high performance on DROP dataset, it is incomprehensible for the reasoning procedure.

3 Model

In this section, we tend to illustrate our proposed methods. Basically, we will show the incorporation of questions in Section 3.1. In Section 3.2, the newly extended module: addition and subtraction will be described.

3.1 The Incorporation of Questions

Taking one module compare-date as a case study: it performs comparisons between two references queried by the question. A key reasoning step inside, is the find-date module that obtains appropriate a date token distribution DD related to each reference: find-date(P)D\verb|find-date|(P)\rightarrow D. It is worth noting that there is no interaction with the question, which could contain essential information (e.g., entities) that is useful to correctly answer the question. Therefore, we revise the find-date module as follows: find-date(P,Q)D\verb|find-date|(P,Q)\rightarrow D:

Si,djdate\displaystyle\textbf{S}^{date}_{i,d_{j}} =[αP;(1α)Q]iWdatePdj,\displaystyle=[\alpha\textbf{P};(1-\alpha)\textbf{Q}]_{i}\textbf{W}_{date}\textbf{P}_{d_{j}}, (1)
Ai:date\displaystyle\textbf{A}^{date}_{i:} =softmax(Si:date),\displaystyle=softmax(\textbf{S}^{date}_{i:}), (2)
D\displaystyle D =i[αP;(1α)Q]iAi:date\displaystyle={\textstyle\sum_{i}}[\alpha P;(1-\alpha)Q]_{i}\cdot\textbf{A}^{date}_{i:} (3)

where P and Q represent the contextualized embeddings of the paragraph and question, and Pdj\textbf{P}_{d_{j}} of the jthj^{th} date tokens in the paragraph, Wdate\textbf{W}_{date} is a trainable parameter, P,QP,Q are the expected attention distribution of the paragraph and the question respectively.

In Equation 1, we concatenate the paragraph embeddings P and question embeddings Q that output from a pre-trained BERT Devlin et al. (2019) model to construct the context representation. A hyper-parameter α\alpha is used to adjust their contributions, whose value is empirically determined (Appendix A.1). The context representation is provided to compute the improved similarity matrix Sdate\textbf{S}_{date}. We concatenate the paragraph and question attention inputs in the same way to calculate the final expected distribution over the date tokens DD (Eq. 3). Now the interpreter is equipped with question information to make the prediction.

3.2 Addition and Subtraction Modules

In the NMNs’ modelling paradigm, for addition/subtraction operations, the programmer takes as input two number distributions and produces an output number distribution over all possible result values: add/sub(N1,N2)RL\verb|add/sub|(N_{1},N_{2})\rightarrow RL. N1N_{1} and N2N_{2} represent the probability distributions of the 1st1^{st} and 2nd2^{nd} operands over all numbers that are extracted from the paragraph and collected into a sorted operand list OLOL. The positive and negative values of these numbers are exhaustively combined in pairs, from which the possible results of addition/subtraction operations are compiled into a sorted result list RLRL. For each input number distribution Ni,i=1,2N_{i},i=1,2, a matrix Cim×n\textbf{C}_{i}\in\mathbb{R}^{m\times n} is constructed, where mm is the total number of possible results, and nn is the maximum number of unique combinations. Each value Ci[j,k]\textbf{C}_{i}[j,k] is found by looking up the probability value in Ni[k]N_{i}[k] where OL[k]OL[k] is the ithi^{th} operand in any pair that produces result RL[j]RL[j]. The probability that the jthj^{th} number in NiN_{i} is the correct operand of the kthk^{th} pair. We compute the marginalized joint probability by summing over the product of CiC_{i} as the expected distribution over result list RLRL. For the addition module, it is:

p(prediction=RL[j])=\displaystyle p(prediction=RL[j])=
k1,k2=1n𝟙(OL[k1]+OL[k2]=RL[j])C1[j,k1]C2[j,k2]\displaystyle\sum_{k_{1},k_{2}=1}^{n}\mathbb{1}_{(OL[k_{1}]+OL[k_{2}]=RL[j])}\textbf{C}_{1}[j,k_{1}]*\textbf{C}_{2}[j,k_{2}]

For instance, assume the sorted operand list OLOL from a paragraph is [1, 5, 7, 11] and N1=[0.1,0.4,0.2,0.3]N_{1}=[0.1,0.4,0.2,0.3]. Different combinations are formed, e.g., (+n1n_{1}, +n2n_{2}) for addition and (+n1n_{1}, -n2n_{2}) for subtraction, and all possible results of the combinations are compiled into two result lists, one for addition and one for subtraction. For subtraction in this case, RL=[0,2,4,6,10]RL=[0,2,4,6,10]. The value of C1[2,1]\textbf{C}_{1}[2,1] is 0.4, which is found from N1[1]N_{1}[1] because the result 4 can be calculated from (+5, -1); and C1[2,3]=0.3\textbf{C}_{1}[2,3]=0.3 which equals to N1[3]N_{1}[3] as 4 is the result of (+11, -7) as well. C2\textbf{C}_{2} is computed in the same way to further obtain final distribution over RLRL.

We compose add/sub modules in programs to perform 3-number arithmetic. The key to our approach is to construct and distinguish appropriate CiC_{i} and RLRL in different reasoning steps. In the second arithmetic step, we should combine the operand list from the paragraph and the result list from the previous step to obtain a new result list RLRL^{\prime}, add/sub(RL,N)RL\verb|add/sub|(RL,N)\rightarrow RL^{\prime}. Due to the changes in operands and results, the modules should refer to a different Cim×n\textbf{C}_{i}^{\prime}\in\mathbb{R}^{m^{\prime}\times n^{\prime}} in the computation. We extend 2-number add/sub modules to recognize the participation of the third number by conditional statement, in order to differentiate the operand and result lists the interpreter should refer to in different steps. Taking the last example in Figure 1, the addition module would first compute the distribution over result list for ‘Albanian and Bulgarian citizens’. The subtraction module can identify itself in the second step calculation and take the correct input to construct the new matrix Ci\textbf{C}_{i}^{\prime}. The expected distribution over new result list RLRL^{\prime} now represent the difference of ‘Greek citizens’ and the previous result.

Instead of introducing specific modules for multi-number arithmetic such as ‘3-num-add’, the structure of NMNs allows us to recursively execute basic operations several times in a compositional program. This design is in accord with the reasoning process of the CQA task, and natural for NMNs to perform complex computations.

4 Experiments

Dataset.

We construct our own train/dev/test sets based on the DROP dataset Dua et al. (2019), which requires numerical reasoning skills.

Gupta et al. (2020) extracted a subset of questions from DROP that is supported by the model’s reasoning capability. This subset contains approximately 20,000/500/2,000 QA pairs for train/dev/test. To train the add/sub modules, we augment the NMNs’ subset with more than 5,000 new questions from DROP. These questions were heuristically identified based on first n-grams and regular expressions (Appendix A.2). Statistics of this newly constructed dataset can be found in Table 1. Note that the add-sub questions include both 2-/and 3-number arithmetic and all experiments in this paper are conducted on this new dataset. Model performance is evaluated with the same F1 and EM (Exact Match) scores as Gupta et al. (2020).

Question types train dev test
Full 25,165 623 2,547
date-compare (13.9%) 3,505 91 333
date-difference (12.2%) 3,055 75 313
number-compare (12.1%) 2,642 157 632
extract-number (12.8%) 3,349 57 222
count (17.3%) 4,527 73 288
extract-argument (13.1%) 3,467 51 208
add-sub (18.6%) 4,689 124 553
   2-numbers 4,440 106 505
   3-numbers 259 24 66
Table 1: Question types distribution on the expanded DROP subset used in the follow experiments.

Result.

In Table 2, we list the overall performance of the original NMNs, NumNet and our proposed method NMNs±.

Method F1 EM
original NMNs Gupta et al. (2020) 57.5 54.9
NumNet Ran et al. (2019) 72.3 69.4
NMNs± (ours) 75.2 72.6
   w/o add/sub 61.4 58.1
   w/o qi 74.3 71.7
Table 2: Performance comparison between different models on our test set. Constrained by the page limit, case study and analysis are in Appendix A.5.

In Table 2, row “w/o add-sub” is the model variant with question attention only, and row “w/o qi” only has the add/sub modules only. Compared to the original NMNs, two proposed methods both improve model performance and the add/sub modules contributes more. Our full NMNs± model, with both components added, achieves 75.2% F1 and 72.6% EM scores, obtaining significant deltas of 17.7% absolute points compared to the original NMNs for both F1 and EM. Additionally, NMNs± outperforms NumNet by 2.9% and 3.2% absoule points in F1 and EM.

It can be unfair since the original NMNs will perform poorly on the newly added add-sub questions. Therefore, we list the model performance on different question types in Table 3. Our model achieves higher scores across almost all question types comparing to the original NMNs, attesting to the effectiveness of our proposed techniques. And it turns out that adding ADD-SUB question types and more training data does not improve the results of the original DROP split. This might due to the performance degradation of the programmer after adding these new ADD-SUB programs. When comparing to NumNet, though our model fail on 2-number add-sub questions, we achieve 5.4% F1 improvement on 3-number add-sub questions, thus results in a comparable performance. Note that the 2-number data is nearly 18 times the 3-number data, which shows our model or NMNs relies less on large scale datasets.

Question type NMNs NMNs± NumNet
date-compare 79.2 84.9 72.0
date-difference 69.0 73.3 74.1
number-compare 89.6 90.3 89.9
extract-number 86.4 89.1 85.6
count 54.2 60.2 52.4
extract-argument 73.4 75.3 66.1
add-sub 0.7 66.4 67.6
   2-numbers 0.8 67.9 71.5
   3-numbers 0.3 41.2 35.8
Table 3: F1 comparison on different question types.

Additional ablation studies for the add/sub modules (A.3) and a qualitative analysis (A.4) can be found in the appendix.

5 Conclusion

In this work, we extend NMNs’ numerical reasoning capability to 2-/and 3-number addition and subtraction, and incorporate the influence of question information to the interpreter on number related questions. Experimental results show that our methods significantly enhance NMNs’ numerical reasoning ability, with an increase of 17.7% absolute F1 points on a newly constructed DROP subset that includes arithmetic questions. Moreover, our approach also outperforms NumNet, a SOTA numerical reasoning model, by 2.9% F1 points.

Acknowledgements

This work is partially funded by the DARPA CCU program (HR001121S0024).

References

Appendix A Appendix

A.1 Hyper-parameter setting for compare-date modules

As mentioned above, we use a hyper-parameter α\alpha to represent question’s and paragraph’s weights for the combined context representation. We determine the final coefficient through a series of control parameter comparison experiments: use the same data to train and validate the model with different α\alpha. The model achieves the best performance (84.9 F1) for date-compare questions when α\alpha was set to 0.4 (40% for paragraph attention and 60% for question attention), which increase 5.7 absolute points compared to the original NMNs model. The experiment verifies the importance of question information in the numerical reasoning process.

A.2 Data extraction

In this research, we expand the DROP subset for original NMNs to cover addition and subtraction questions. Subtraction questions can be easily targeted by their first n-gram, such as ‘how many more’, ‘how many yards difference’. For three number subtraction, we need to further specified the format by regular expression, such as ‘how many more event-a and event-b than event-c?’ or ‘how many more event-a compared to event-b and event-c?’. For addition, it is hard to identify how many numbers should participate in the calculation from some of the questions (e.g. ‘how many total yards did Roethlisberger get in the game?’). Therefore, we use regular expression to distinguish two or three numbers addition and follow the patterns such as ‘how many total…’, ‘how many … combined’.

A.3 Addition and subtraction modules training

To discuss the contribution of individual addition and subtraction module for NMNs, we conduct an ablation experiment by training and testing the model on different datasets as shown in Table 4. The five rows represent the model trained on various datasets: addition questions only, subtraction questions only, addition and the original NMNs subset, subtraction and original NMNs subset and our full subset. The columns indicate the model performance results when they test on addition/subtraction questions only and the full DROP subset. As can be seen from the result, the model with subtraction ability only perform greater than with addition ability only.

Datasets addsub dataset full dataset
F1 EM F1 EM
add 41.2 41.2 46.0 43.8
sub 45.7 45.7 51.3 49.2
add+origin 51.5 51.5 69.2 64.2
sub+origin 55.1 55.1 72.6 69.8
add+sub+origin 66.4 66.4 74.3 71.7
Table 4: Ablation experiment result for addition and subtraction modules

A.4 Qualitative analysis

Figure 2 shows some incorrect prediction cases from the original NMNs and the answer from our improved model NMNs±. From the examples, we can clearly identify how the proposed techniques improve the numerical reasoning process:

Refer to caption
Figure 2: Qualitative analysis. The highlighted spans are corresponding to the modules in the program for each question.
  • In the first example, the original NMNs match wrong tokens ‘dissolved the Constituent Assembly’ given the question ‘Which event happened first, the Constituent Assembly being elected, or the elimination of hierarchy in the army?’, thus located a wrong date ‘January 1918’. After enhancing the interpreter’s awareness of the question, NMNs± can precisely target the spans ‘a Constituent Assembly was elected’ in the paragraph and further provide the correct prediction.

  • The following two examples are wrongly answered by the original NMNs because of incorrect program predictions. The second question was initially categorized into a count question which called the count module to calculate the number of attended paragraph spans. The same situation occurs in the third question, because the original NMNs lack the modules that can correctly expresses the reasoning behind the question. The prediction results prove that our NMNs± model handle simple arithmetic operations such as addition and subtraction which meets the task requirement.

A.5 Prediction analysis

The wrong prediction cases study for the original NMNs over DROP is the main motivation of our proposed methods. We conclude the error factors of five numerical question types in detail: date-compare, count, date-difference, number-compare and extract-number.

Refer to caption
Figure 3: Root causes for the wrong prediction in date-compare questions. The related events mentioned in the question are highlighted in blue and red, and their relevant dates are in the same color with underline.
Refer to caption
Figure 4: Root causes for the wrong prediction in count questions. The inputs to the find module and their targets in the paragraph are highlighted in red. The blue spans are related to the filter module.
Refer to caption
Figure 5: Root causes for the wrong prediction in date-difference questions. The related events are highlighted in blue, which is the input of the find module. The dates grounding correctly predicted in the compare-date modules are highlighted in red color. The answer predicted by NMNs should be the difference of these two dates.
Refer to caption
Figure 6: Root causes for the wrong prediction in number-compare questions. Similar to figure 1, the input of the find module is highlighted in blue and red, and their related numbers are underlined. The paragraph span predicted as the answer is the one associated to a smaller/larger-valued number according to the questions asking.
Refer to caption
Figure 7: Root causes for the wrong prediction in extract-number questions. The inputs to the find module and their targets in the paragraph are highlighted in red. The blue spans are related to the filter module. The find-num module finally extracts the number associated with this paragraph attention as the answer.