Teaching Neural Module Networks to Do Arithmetic
Abstract
Answering complex questions that require multi-step multi-type reasoning over raw text is challenging, especially when conducting numerical reasoning. Neural Module Networks (NMNs), follow the programmer-interpreter framework and design trainable modules to learn different reasoning skills. However, NMNs only have limited reasoning abilities, and lack numerical reasoning capability. We upgrade NMNs by: (a) bridging the gap between its interpreter and the complex questions; (b) introducing addition and subtraction modules that perform numerical reasoning over numbers. On a subset of DROP, experimental results show that our proposed methods enhance NMNs’ numerical reasoning skills by 17.7% improvement of F1 score and significantly outperform previous state-of-the-art models.

1 Introduction
Complex Question Answering (CQA) over text is a challenging task in Natural Language Understanding (NLU). Based on the programmer-interpreter paradigm, Neural Module Networks (NMNs) Gupta et al. (2020) learn to first parse complex questions as executable programs composed of various predefined trainable modules, and then execute such programs (implemented by modules) over the given paragraph to predict answers of all kinds. NMNs achieve competitive reasoning performance on a subset of DROP Dua et al. (2019), and possess remarkable interpretability that is also important for CQA.
However, NMNs’ numerical reasoning capability is insufficient: it is incapable of handling arithmetic operations such as addition and subtraction between numbers, which make up nearly 40% questions of the DROP dataset. Moreover, a gap exists between the interpreter and the complex question since there is no interaction between them. Motivated by these, we propose two methods to improve NMNs’ numerical reasoning skills.
First, we incorporate the original question in the interpreter, aiming to directly provide question information in the “execution” process, especially number-related questions. The intuition behind is that, in the original NMNs, questions participate in the process only through the programmer. This can cause a distance between queries and returns. For example, in Figure 1, the first row shows that the original NMNs found the wrong event (i.e., ‘besieged Sinj’) solely based on the paragraph information. In contrast, our model NMNs± can easily target the correct event (i.e., ‘Sinj finally fell’) with the help of question information.
Second, we introduce new modules to support addition and subtraction of up to three numbers.
Endowing NMNs with the ability to support arithmetic can greatly boost its overall performance on DROP and beyond.
For instance, in Figure 1, the second row shows that the original NMNs improperly adopt the find-num
module for the addition question because the module set does not cover such an arithmetic ability.
To facilitate the learning of the add/sub
modules, we extract QA pairs related to addition and subtraction from the original DROP dataset to construct a new dataset for training and evaluation.
Experimental results show that our methods significantly enhance NMNs’ numerical reasoning capability. On a subset of DROP, our methods improve F1 score by 17.7% absolute points, and on add-sub questions by 65.7% absolute points. Compared to NumNet Ran et al. (2019), which is specifically designed for numerical reasoning, our method outperforms it by 2.9% absolute F1 points.
2 Background and Related Work
Semantic Parsing is a widely-adopted approach in the compositional question answering (CQA) task, which involves a number of reasoning steps. In this approach, a programmer maps natural-language questions into machine-readable representations (logical forms), which are executed by an interpreter to yield the final answer. For instance, WNSMN Saha et al. (2021) uses a generalized framework of dependency parsing inspired by the Stanford dependency parse tree Chen and Manning (2014) to parse queries into noisy heuristic programs. Neural Module Networks Gupta et al. (2020) extend semantic parsing by making interpreter a learnable function with specified modules and executing the logical forms from the programmer in a step-wise manner.
Neural Module Networks initially is proposed to overcome the Visual Question Answering (VQA) problem Andreas et al. (2016), where questions are often compositional.
Gupta et al. (2020) employs the programmer-interpreter framework with attention Vaswani et al. (2017) to tackle the CQA task.
Specifically, the programmer parses each question into an executable program.
The interpreter takes the program as input and perform various symbolic reasoning functions.
The modules are defined in a differentiable way, aiming to maintain the uncertainty about each intermediate decision output and propagate them through layers.
For instance, the predicted program of the first example in Figure 1 is span(compare-date-lt(find,find))
.
The interpreter would first calls the find
module twice to find events queried by the question (e.g., ‘the fell of Sinj’) and outputs appropriate paragraph attention.
The compare-date-lt
module can further locate the dates (e.g., ‘30 September 1686’) to compute their relation.
By demonstrating the intermediate reasoning steps in this manner, NMNs perform interpretable problem-solving.
Numerical Reasoning is a necessary ability for models to handle the CQA task Geva et al. (2020). Dua et al. (2019) modify the output layer of QANet Yu et al. (2018) and propose a number-aware model NAQANet to deal with numerical questions. NumNet Ran et al. (2019) leverage Graph Neural Network to capture relations between numbers. Similarly, QDGAT Chen et al. (2020a) distinguish number types more precisely by adding the connection with entities and obtained better performance. Nerd Chen et al. (2020b) search possible programs exhaustively based on answers and employed these programs as weak supervision. Another similar work Guo et al. (2021) proposes a question-aware interpreter but uses a totally different approach to measure the alignment between the question and the context paragraph. Though these approaches can achieve the high performance on DROP dataset, it is incomprehensible for the reasoning procedure.
3 Model
In this section, we tend to illustrate our proposed methods. Basically, we will show the incorporation of questions in Section 3.1. In Section 3.2, the newly extended module: addition and subtraction will be described.
3.1 The Incorporation of Questions
Taking one module compare-date
as a case study: it performs comparisons between two references queried by the question.
A key reasoning step inside, is the find-date
module that obtains appropriate a date token distribution related to each reference: .
It is worth noting that there is no interaction with the question, which could contain essential information (e.g., entities) that is useful to correctly answer the question.
Therefore, we revise the find-date
module as follows: :
(1) | ||||
(2) | ||||
(3) |
where P and Q represent the contextualized embeddings of the paragraph and question, and of the date tokens in the paragraph, is a trainable parameter, are the expected attention distribution of the paragraph and the question respectively.
In Equation 1, we concatenate the paragraph embeddings P and question embeddings Q that output from a pre-trained BERT Devlin et al. (2019) model to construct the context representation. A hyper-parameter is used to adjust their contributions, whose value is empirically determined (Appendix A.1). The context representation is provided to compute the improved similarity matrix . We concatenate the paragraph and question attention inputs in the same way to calculate the final expected distribution over the date tokens (Eq. 3). Now the interpreter is equipped with question information to make the prediction.
3.2 Addition and Subtraction Modules
In the NMNs’ modelling paradigm, for addition/subtraction operations, the programmer takes as input two number distributions and produces an output number distribution over all possible result values: . and represent the probability distributions of the and operands over all numbers that are extracted from the paragraph and collected into a sorted operand list . The positive and negative values of these numbers are exhaustively combined in pairs, from which the possible results of addition/subtraction operations are compiled into a sorted result list . For each input number distribution , a matrix is constructed, where is the total number of possible results, and is the maximum number of unique combinations. Each value is found by looking up the probability value in where is the operand in any pair that produces result . The probability that the number in is the correct operand of the pair. We compute the marginalized joint probability by summing over the product of as the expected distribution over result list . For the addition module, it is:
For instance, assume the sorted operand list from a paragraph is [1, 5, 7, 11] and . Different combinations are formed, e.g., (+, +) for addition and (+, -) for subtraction, and all possible results of the combinations are compiled into two result lists, one for addition and one for subtraction. For subtraction in this case, . The value of is 0.4, which is found from because the result 4 can be calculated from (+5, -1); and which equals to as 4 is the result of (+11, -7) as well. is computed in the same way to further obtain final distribution over .
We compose add/sub
modules in programs to perform 3-number arithmetic. The key to our approach is to construct and distinguish appropriate and in different reasoning steps.
In the second arithmetic step, we should combine the operand list from the paragraph and the result list from the previous step to obtain a new result list , . Due to the changes in operands and results, the modules should refer to a different in the computation.
We extend 2-number add/sub
modules to recognize the participation of the third number by conditional statement, in order to differentiate the operand and result lists the interpreter should refer to in different steps.
Taking the last example in Figure 1, the addition
module would first compute the distribution over result list for ‘Albanian and Bulgarian citizens’. The subtraction
module can identify itself in the second step calculation and take the correct input to construct the new matrix . The expected distribution over new result list now represent the difference of ‘Greek citizens’ and the previous result.
Instead of introducing specific modules for multi-number arithmetic such as ‘3-num-add’, the structure of NMNs allows us to recursively execute basic operations several times in a compositional program. This design is in accord with the reasoning process of the CQA task, and natural for NMNs to perform complex computations.
4 Experiments
Dataset.
We construct our own train/dev/test sets based on the DROP dataset Dua et al. (2019), which requires numerical reasoning skills.
Gupta et al. (2020) extracted a subset of questions from DROP that is supported by the model’s reasoning capability.
This subset contains approximately 20,000/500/2,000 QA pairs for train/dev/test.
To train the add/sub
modules, we augment the NMNs’ subset with more than 5,000 new questions from DROP.
These questions were heuristically identified based on first n-grams and regular expressions (Appendix A.2).
Statistics of this newly constructed dataset can be found in Table 1.
Note that the add-sub questions include both 2-/and 3-number arithmetic and all experiments in this paper are conducted on this new dataset.
Model performance is evaluated with the same F1 and EM (Exact Match) scores as Gupta et al. (2020).
Question types | train | dev | test |
---|---|---|---|
Full | 25,165 | 623 | 2,547 |
date-compare (13.9%) | 3,505 | 91 | 333 |
date-difference (12.2%) | 3,055 | 75 | 313 |
number-compare (12.1%) | 2,642 | 157 | 632 |
extract-number (12.8%) | 3,349 | 57 | 222 |
count (17.3%) | 4,527 | 73 | 288 |
extract-argument (13.1%) | 3,467 | 51 | 208 |
add-sub (18.6%) | 4,689 | 124 | 553 |
2-numbers | 4,440 | 106 | 505 |
3-numbers | 259 | 24 | 66 |
Result.
In Table 2, we list the overall performance of the original NMNs, NumNet and our proposed method NMNs±.
Method | F1 | EM |
---|---|---|
original NMNs Gupta et al. (2020) | 57.5 | 54.9 |
NumNet Ran et al. (2019) | 72.3 | 69.4 |
NMNs± (ours) | 75.2 | 72.6 |
w/o add/sub | 61.4 | 58.1 |
w/o qi | 74.3 | 71.7 |
In Table 2, row “w/o add-sub” is the model variant with question attention only, and row “w/o qi” only has the add/sub
modules only.
Compared to the original NMNs, two proposed methods both improve model performance and the add/sub
modules contributes more.
Our full NMNs± model, with both components added, achieves 75.2% F1 and 72.6% EM scores, obtaining significant deltas of 17.7% absolute points compared to the original NMNs for both F1 and EM.
Additionally, NMNs± outperforms NumNet by 2.9% and 3.2% absoule points in F1 and EM.
It can be unfair since the original NMNs will perform poorly on the newly added add-sub questions. Therefore, we list the model performance on different question types in Table 3. Our model achieves higher scores across almost all question types comparing to the original NMNs, attesting to the effectiveness of our proposed techniques. And it turns out that adding ADD-SUB question types and more training data does not improve the results of the original DROP split. This might due to the performance degradation of the programmer after adding these new ADD-SUB programs. When comparing to NumNet, though our model fail on 2-number add-sub questions, we achieve 5.4% F1 improvement on 3-number add-sub questions, thus results in a comparable performance. Note that the 2-number data is nearly 18 times the 3-number data, which shows our model or NMNs relies less on large scale datasets.
Question type | NMNs | NMNs± | NumNet |
---|---|---|---|
date-compare | 79.2 | 84.9 | 72.0 |
date-difference | 69.0 | 73.3 | 74.1 |
number-compare | 89.6 | 90.3 | 89.9 |
extract-number | 86.4 | 89.1 | 85.6 |
count | 54.2 | 60.2 | 52.4 |
extract-argument | 73.4 | 75.3 | 66.1 |
add-sub | 0.7 | 66.4 | 67.6 |
2-numbers | 0.8 | 67.9 | 71.5 |
3-numbers | 0.3 | 41.2 | 35.8 |
5 Conclusion
In this work, we extend NMNs’ numerical reasoning capability to 2-/and 3-number addition and subtraction, and incorporate the influence of question information to the interpreter on number related questions. Experimental results show that our methods significantly enhance NMNs’ numerical reasoning ability, with an increase of 17.7% absolute F1 points on a newly constructed DROP subset that includes arithmetic questions. Moreover, our approach also outperforms NumNet, a SOTA numerical reasoning model, by 2.9% F1 points.
Acknowledgements
This work is partially funded by the DARPA CCU program (HR001121S0024).
References
- Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 39–48. IEEE Computer Society.
- Chen and Manning (2014) Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of EMNLP, pages 740–750.
- Chen et al. (2020a) Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xiaochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, and Wei Chu. 2020a. Question directed graph attention network for numerical reasoning over text. In Proceedings of EMNLP, pages 6759–6768.
- Chen et al. (2020b) Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2020b. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. In Proceedings of ICLR.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL-HLT, pages 2368–2378.
- Geva et al. (2020) Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 946–958. Association for Computational Linguistics.
- Guo et al. (2021) Xiaoyu Guo, Yuan-Fang Li, and Gholamreza Haffari. 2021. Improving numerical reasoning skills in the modular approach for complex question answering on text. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 2713–2718. Association for Computational Linguistics.
- Gupta et al. (2020) Nitish Gupta, Kevin Lin, Dan Roth, Sameer Singh, and Matt Gardner. 2020. Neural module networks for reasoning over text. In Proceedings of ICLR.
- Ran et al. (2019) Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. Numnet: Machine reading comprehension with numerical reasoning. In Proceedings of EMNLP-IJCNLP 2019, pages 2474–2484.
- Saha et al. (2021) Amrita Saha, Shafiq R. Joty, and Steven C. H. Hoi. 2021. Weakly supervised neuro-symbolic module networks for numerical reasoning. CoRR, abs/2101.11802.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proccedings of NIPS, pages 5998–6008.
- Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of ICLR.
Appendix A Appendix
A.1 Hyper-parameter setting for compare-date modules
As mentioned above, we use a hyper-parameter to represent question’s and paragraph’s weights for the combined context representation. We determine the final coefficient through a series of control parameter comparison experiments: use the same data to train and validate the model with different . The model achieves the best performance (84.9 F1) for date-compare questions when was set to 0.4 (40% for paragraph attention and 60% for question attention), which increase 5.7 absolute points compared to the original NMNs model. The experiment verifies the importance of question information in the numerical reasoning process.
A.2 Data extraction
In this research, we expand the DROP subset for original NMNs to cover addition and subtraction questions. Subtraction questions can be easily targeted by their first n-gram, such as ‘how many more’, ‘how many yards difference’. For three number subtraction, we need to further specified the format by regular expression, such as ‘how many more event-a and event-b than event-c?’ or ‘how many more event-a compared to event-b and event-c?’. For addition, it is hard to identify how many numbers should participate in the calculation from some of the questions (e.g. ‘how many total yards did Roethlisberger get in the game?’). Therefore, we use regular expression to distinguish two or three numbers addition and follow the patterns such as ‘how many total…’, ‘how many … combined’.
A.3 Addition and subtraction modules training
To discuss the contribution of individual addition
and subtraction
module for NMNs, we conduct an ablation experiment by training and testing the model on different datasets as shown in Table 4. The five rows represent the model trained on various datasets: addition questions only, subtraction questions only, addition and the original NMNs subset, subtraction and original NMNs subset and our full subset. The columns indicate the model performance results when they test on addition/subtraction questions only and the full DROP subset. As can be seen from the result, the model with subtraction ability only perform greater than with addition ability only.
Datasets | addsub dataset | full dataset | ||
---|---|---|---|---|
F1 | EM | F1 | EM | |
add | 41.2 | 41.2 | 46.0 | 43.8 |
sub | 45.7 | 45.7 | 51.3 | 49.2 |
add+origin | 51.5 | 51.5 | 69.2 | 64.2 |
sub+origin | 55.1 | 55.1 | 72.6 | 69.8 |
add+sub+origin | 66.4 | 66.4 | 74.3 | 71.7 |
A.4 Qualitative analysis
Figure 2 shows some incorrect prediction cases from the original NMNs and the answer from our improved model NMNs±. From the examples, we can clearly identify how the proposed techniques improve the numerical reasoning process:

-
•
In the first example, the original NMNs match wrong tokens ‘dissolved the Constituent Assembly’ given the question ‘Which event happened first, the Constituent Assembly being elected, or the elimination of hierarchy in the army?’, thus located a wrong date ‘January 1918’. After enhancing the interpreter’s awareness of the question, NMNs± can precisely target the spans ‘a Constituent Assembly was elected’ in the paragraph and further provide the correct prediction.
-
•
The following two examples are wrongly answered by the original NMNs because of incorrect program predictions. The second question was initially categorized into a count question which called the
count
module to calculate the number of attended paragraph spans. The same situation occurs in the third question, because the original NMNs lack the modules that can correctly expresses the reasoning behind the question. The prediction results prove that our NMNs± model handle simple arithmetic operations such as addition and subtraction which meets the task requirement.
A.5 Prediction analysis
The wrong prediction cases study for the original NMNs over DROP is the main motivation of our proposed methods. We conclude the error factors of five numerical question types in detail: date-compare, count, date-difference, number-compare and extract-number.




