MetaRuleGPT: Recursive Numerical Reasoning of Language Models Trained with Simple Rules
^†^†thanks: * Equal Contribution ^†^†thanks: ${\dagger}$ Corresponding Author

Kejie Chen^* Zhejiang University
Hangzhou, China
[email protected] Lin Wang^* Zhejiang University
Hangzhou, China
[email protected] Qinghai Zhang^${\dagger}$ Zhejiang University
Hangzhou, China
[email protected] Renjun Xu^${\dagger}$ Zhejiang University
Hangzhou, China
[email protected]

Abstract

Recent studies have highlighted the limitations of large language models in mathematical reasoning, particularly their inability to capture the underlying logic. Inspired by meta-learning, we propose that models should acquire not only task-specific knowledge but also transferable problem-solving skills. We introduce MetaRuleGPT, a novel Transformer-based architecture that performs precise numerical calculations and complex logical operations by learning and combining different rules. In contrast with traditional training sets, which are heavily composed of massive raw instance data, MetaRuleGPT is pre-trained on much less abstract datasets containing basic, compound, and iterative rules for mathematical reasoning. Extensive experimental results demonstrate MetaRuleGPT can mimic human’s rule-following capabilities, break down complexity, and iteratively derive accurate results for complex mathematical problems. These findings prove the potential of rule learning to enhance the numerical reasoning abilities of language models.

Index Terms:

LLMs, rules, high-digit calculations

I Introduction

In the field of Natural Language Processing, large language models such as GPT-4 [1] have made remarkable progress, demonstrating impressive understanding and generation capabilities across a wide range of tasks such as text summarization[2][3][4] question answering[5][6] and mathematical reasoning[7][8][9]. However, these models still face considerable challenges when encountering mathematical tasks in a broad spectrum, including but not limited to basic arithmetic operations, calculus, and equation solving. Despite their powerful language understanding capabilities, these models often struggle with even simple numerical calculations. For example, given a straightforward addition problem:

241257284+758742716=1000000000,

most current mainstream language models have difficulty in correctly completing such basic mathematical operations.

Inspired by the success of chain-of-thought(CoT)[10] reasoning, we can see that, similar to the human brain’s System II, a model can achieve more reliable results on complex input problems through repeated and recursive thinking. In mathematics, this approach—recursive reasoning with a simple set of rules—serves as an effective paradigm for precisely solving complex issues.

In this context, we introduce MetaRuleGPT, a cutting-edge Transformer-based[11] architecture that excels CoT in precise numerical calculations and intricate logical operations through the learning and integration of diverse rules as shown in Fig 1. MetaRuleGPT is adept at emulating human-like rule-following behaviors, enabling it to simplify complexities and iteratively arrive at accurate solutions for challenging mathematical problems. This advancement underscores the promise of rule learning in enhancing the numerical reasoning capabilities of language models.

Refer to caption — Figure 1: This image illustrates the differences between MetaRuleGPT and the traditional Chain-of-Thought (CoT) reasoning method in handling mathematical problems. MetaRuleGPT employs a rule-based reasoning approach, breaking down and solving problems step-by-step through predefined rules, such as alignment rules, carry rules, and borrow rules. This structured method ensures accuracy and generalization ability in reasoning, allowing the model to systematically handle various computational tasks while avoiding the common hallucination errors found in CoT methods.

II RELATED WORK

A body of research indicates that LLMs often fail to capture the underlying logic required for mathematical problem-solving, particularly in tasks[6][12] that necessitate precise numerical calculations and complex logical operations. Brown et al. [13], have examined the limitations of LLMs in mathematical tasks, revealing that even basic operations can lead to inaccuracies. A few methods[14][15][16] are proposed to improve the quality of reasoning paths. Complexity-based CoT[15] selects examples with more steps as in-context demonstrations and shows that prompting with more reasoning steps leads to better performance. Wei et al[17] explored the role of fine-tuning and prompt engineering in improving LLM performance on specific tasks, yet these methods often fall short when applied to higher-order mathematical reasoning.

Meta-learning[18] has emerged as a promising approach to enhance model capabilities, emphasizing the need for transferable problem-solving skills rather than mere task-specific knowledge. Research by Finn et al. [19] and subsequent meta-learning frameworks illustrate how models can be designed to generalize across tasks by learning from previous experiences. This paradigm has inspired our development of MetaRuleGPT, which not only seeks to improve numerical reasoning but also aims to equip models with the ability to learn and apply a variety of mathematical rules through structured pre-training.

Rule learning[20] demonstrated its efficacy in enhancing reasoning abilities, showing that integrating rule-based reasoning into model architectures can yield significant performance gains, especially in complex logical reasoning tasks. Building on these insights, MetaRuleGPT employs a Transformer-based architecture that systematically incorporates basic, compound, and iterative rules for mathematical reasoning, thereby addressing the shortcomings observed in existing models.

III Research Methodology

To enhance the accuracy and generalization of language models [21] in solving complex logical reasoning and numerical calculation tasks, we introduce the MetaRuleGPT model. This model aims to bolster the reasoning capabilities and generalization potential of language models [22], inspired by the concept of meta-learning. MetaRuleGPT focuses on mastering general learning strategies to precisely complete complex logical deduction tasks by applying learned rules. The model dynamically integrates basic mathematical computation rules with higher-order operation rules, enhancing its ability to process rule combinations. This design allows MetaRuleGPT to exhibit superior accuracy and generalization when faced with complex logical reasoning challenges, such as mathematical problems.

Furthermore, by adopting this strategy, MetaRuleGPT is not limited to handling a single task but is capable of learning and executing various different tasks. When dealing with multitasking, the rules across different tasks might intersect; our model can flexibly learn these intersecting rules and dynamically apply them to complete multiple tasks simultaneously, while keeping the tasks independent of each other without interference.

III-A Specific Rule Learning for Arithmetic Tasks

TABLE I: Summary Table of Learning Rules for Various Tasks

Rule Type	Numerical Addition	Numerical Subtraction	Vector Cross Product
Vector Table	-	-	$\surd$
Nine Addition Table	$\surd$	-	$\surd$
Nine Subtraction Table	-	$\surd$	$\surd$
Nine Multiplication Table	-	-	$\surd$
Mapping Rule	$\surd$	$\surd$	$\surd$
Carrying Rule	$\surd$	-	$\surd$
Borrowing Rule	-	$\surd$	$\surd$
Vector Product Rule	-	-	$\surd$
Compute Rule	$\surd$	$\surd$	$\surd$

In our research, the model exhibited exceptional logical reasoning and generalization capabilities while tackling three complex tasks: high-digit(10 or more digits) addition and subtraction, as well as vector cross-product computations. To facilitate these tasks, we created specialized rule datasets for training. For instance, during the training for addition calculations, the model was guided to acquire essential knowledge, including single-digit addition rules, carry rules, digit mapping rules(between numbers and strings), and fundamental computation rules in Fig. 2. By mastering these foundational concepts, and following meticulous pre-training [21], our model was able to flexibly apply and integrate these rules, enabling it to accurately perform complex mathematical operations, including high-digit addition and subtraction.

We extended our model’s capabilities to perform vector cross-product computations since it had mastered the rules of addition and subtraction. To achieve this goal, we introduced rules for vector representation and cross-product computation into the model to realize vector cross-product calculations. This means that once the model learns these new rules, it could combine the newly acquired rules with existing numerical computation rules to perform vector cross-product calculations. During the process of vector cross-product computation, the model needs to handle a large amount of complex derivation. Through gradual derivation, combining the right-hand rule for cross-products with basic numerical computation rules, the model gains the ability to compute vector cross-products. This strategy showcases the model’s deep logical reasoning and strong generalization ability to solve more complex mathematical operations by learning and integrating various rules in TABLE. I.

III-B Arithmetic Dataset

We designed the overall computation rule dataset for training, covering a wide range of arithmetic operations from the most basic single-digit arithmetic tasks to various complex arithmetic rules. This dataset, carefully planned, only contains the most basic single-digit operations, including alignment rules, carry rules, borrow rules, basic computation rules, and composite rules. Each arithmetic expression involves $2$ to $10$ operational steps, involving a series of mathematical computation operations, such as addition $(+)$ , subtraction $(-)$ , and vector cross-product operations $(\times)$ . Our constructed dataset contains approximately $20,000$ records.

III-C MetaRuleGPT Model Structure

To closely mimic the natural process of humans solving mathematical problems, we did not directly solve each complex arithmetic expression but adopted an iterative and stepwise strategy. Through this method, our model breaks down complex expressions into a series of simpler and basic computational steps, reasoning out the final answer step by step. This approach enables the language model to have a deeper understanding and more effective application of specific rules during the learning process, allowing for flexible combination and application of these rules in problem-solving. MetaRuleGPT’s proficiency in mathematical tasks stems primarily from its mastery of core computation rules rather than mere memorization of specific cases.

Focusing on arithmetic tasks, we developed a language model based on the Transformer architecture, aimed at solving mathematical problems, which we refer to as the MetaRuleGPT language model. The model architecture, as shown in the Fig. 3, includes several key components: the MetaRuleGPT pre-trained model, the RefeedFormatter (formatting tool), and VeriGate (verification gate). We designed a self-iteration method that enables the model to simplify complex problems through continuous iteration, and finally obtain the correct answer within a limited number of steps.

III-D MetaRuleGPT Pre-trained Model

We have trained a language model based on the Transformer[11] architecture. To flexibly adjust the model’s parameter size and internal structure, we designed and implemented a custom Transformer model. Given the limited vocabulary involved in the problems we address, we adopted a single-byte-based training method, which offers clear advantages over traditional word-based or character-based methods.

Byte-based language models provide a flexible and effective means for handling multilingual text and unknown characters. Fig. 4 is an example of using the Transformer model to train vector cross product calculation rules. By processing each character individually, the model can ensure more accurate learning of the rules, laying a solid foundation for solving complex logical tasks.

III-E MetaRuleGPT Calculation Example

To demonstrate how the model operates, we use a simple addition example. When the input “Input: 78 + 263” is provided, it is processed sequentially through the Mapping Rule, Compute Rule, Align Rule, and Carry/Borrow Rule to derive the computation result. Fig. 5 illustrates the internal structure of our model and explains how the initial input is transformed into the final result.

1.

First, the model structurally processes our input question, where “78 + 263” under the Mapping Rule becomes:

$a_{1}=7,b_{1}=8,c_{1}=2,d_{1}=6,e_{1}=3.$

The expression “ $a_{1}b_{1}+c_{1}d_{1}e_{1}$ ” through the Align Rule becomes:

$c_{1}|(a_{1}+d_{1})|(b_{1}+e_{1}).$

Through alignment, a combination of mapping rules produces the intermediate output:“ $2|(7+6)|(8+3)$ ”.
2.

Similarly, for “ $2|(7+6)|(8+3)$ ”, a combination of the Mapping Rule and single-digit addition rule (Add Sub-rule) produces the intermediate output: “ $2|13|11$ ”.
3.

When “ $2|13|11$ ” is input, the model invokes the Carry Rule and the Mapping Rule to perform digit carry operations, producing an intermediate output: “ $(2+1)|(3+1)|1$ ”.
4.

“ $(2+1)|(3+1)|1$ ” as a new input, again applying the Mapping Rule and Compute Rule, leads to the final computation result: “ $3|4|1$ ”.
5.

“ $3|4|1$ ” as the final input stage, our model invokes the formatting rules and uses special symbols for marking. Finally, the result is formatted using VeriGate to output: “ $Output:341$ ”.

In summary, MetaRuleGPT utilizes a combination of rules to align, carry, and output the final result during the computation process.

IV EXPERIMENTS

TABLE II: Partial Test Data Display Table

Data Type	Test Dataset Examples
Randomized Procedure	$6729132856+1854307391,\ldots,1554887316-817095695$
Perfect Decadic Addition	$6659891948+340108052,\ldots,4376628072+623371928$
Reverse Magnitude Subtraction	$62103-2386797965,\ldots,53006-7764286617$
Interleaved Subtraction	$1824453209-482835016,\ldots,8858241744-261714262$
Vector Cross Product	$(6,5,7)\times(9,3,1),\ldots,(8,2,0)\times(6,4,9)$

IV-A Experimental Setup

To demonstrate the exceptional accuracy and generalization capability of our MetaRuleGPT model in reasoning tasks, we designed two experiments: numerical arithmetic tasks and vector cross-product computation tasks. These experiments not only tested the model’s basic computational ability but also its ability to solve complex problems, providing a solid foundation for comprehensively evaluating the model’s performance in logical reasoning. Furthermore, to further prove the advantages of MetaRuleGPT, we compared it with several well-known large language models, including ChatGPT-3.5, ChatGPT-4.0 [23], Alibaba’s QWen [24], Google’s Palm [25, 26], Llama2 [27] and Mathematical Mastery Model Goat[28], demonstrating the superior performance of MetaRuleGPT.

IV-B Test Dataset

Current large language models exhibit certain limitations in handling mathematically rigorous problems, partly due to a lack of deep understanding of mathematical logic. In contrast, our model, built from the ground up on the fundamental principles of mathematics, demonstrates higher precision in solving math challenges. To validate this advantage, we designed diverse test datasets and prepared a comprehensive computation dataset to highlight the significant advantage of our model in mathematical reasoning. Through these validations, we demonstrate that MetaRuleGPT not only precisely grasps and applies the basic logic of mathematics but also exhibits powerful generalization in problem-solving.

In the domain of arithmetic tasks, we constructed a diverse training dataset containing a wide range of arithmetic operations. To comprehensively evaluate our model’s computational accuracy and generalization ability, we designed an evaluation dataset containing $8,000$ test cases shown in Table. II, entirely non-overlapping with the training set. This dataset covers various types of numerical operations, including but not limited to perfect decimal addition, reverse magnitude subtraction, misplaced subtraction, and addition and subtraction operations based on randomly generated numbers.

To compare with other models focusing on mathematical calculations, we used an additional substantial number of randomly generated dataset sampled from a logarithmic space¹¹1Our dataset is available at https://www.scidb.cn/en/detail?dataSetId=04575028fb8d4bfabeeba5825c8f57fc. This sampling approach ensures that the numbers are equally likely to originate from different orders of magnitude, similar to the method employed by [29].

TABLE III: Language Models’ Overall Performance in Numerical Tasks

Model	Model Parameter	5-digit	10-digit
GPT-4	$1760$ B $+$	$99.22$ %	$90.9$ %
GPT-3.5	$175$ B $+$	$97.26$ %	$83.9$ %
Llama2-7b	$7$ B	$22.3$ %	$1.7$ %
Llama2-13b	$13$ B	$17.8$ %	$1.6$ %
Llama2-70b	$70$ B	$57.76$ %	$6.4$ %
Google-PaLM	$110$ B	$73.32$ %	$26.6$ %
Qwen-72b-Chat	$72$ B	$91.32$ %	$60.4$ %
MetaRuleGPT	$30$ M	$100$ %	$100$ %

IV-C Evaluation Metrics

In evaluating the model’s performance, we consider not only the accuracy of the computed results but also the difference ratio between the computed and correct answers. A smaller absolute difference ratio indicates that the model’s output is closer to the true value, suggesting potential for improvement through parameter tuning. Conversely, a difference ratio greater than 1 implies that the model struggles with such problems or faces significant challenges. Assuming the number of correctly predicted quantities is TP and the total number of predictions is N, then accuracy can be defined as: Accuracy $=\frac{TP}{N}\times 100\%$

Suppose our model’s computation result is y, the actual computation result N numbers in total, then our final overall difference ratio can be defined as:
DifferenceRatio $=\frac{1}{N}\sum_{i=0}^{N}\left|\frac{y_{i}-\hat{y_{i}}}{\max(y_{i},\hat{y_{i}})}\right|$

TABLE IV: Interleaved Subtraction

Compute Digits	5-digit		10-digit
Rate	Error	Accuracy	Error	Accuracy
GPT-4	$0.016$	$98.3$ %	5.368e-7	$96$ %
GPT-3.5	$0.0033$	$95.2$ %	$0.037$	$91$ %
Llama2-7b	$0.64$	$2.3$ %	$0.92$	$0$ %
Llama2-13b	$0.52$	$21.9$ %	$0.69$	$2$ %
Llama2-70b	$0.061$	$76.1$ %	$0.79$	$2$ %
Google-PaLM	$0.0076$	$95.9$ %	$0.54$	$19$ %
Qwen-72b-Chat	$0.0092$	$93.9$ %	$0.0027$	$74.5$ %
MetaRuleGPT	$0.0$	$100$ %	$0.0$	$100$ %

TABLE V: Reverse Magnitude Subtraction

Compute Digits	5-digit		10-digit
Rate	Error	Accuracy	Error	Accuracy
GPT-4	$0.027$	$97.8$ %	1.3e-8	$96.5$ %
GPT-3.5	$0.0033$	$99.4$ %	8.3e-4	$88.5$ %
Llama2-7b	$0.64$	$20.8$ %	$2.1$	$0.0$ %
Llama2-13b	$0.52$	$4$ %	$1.5$	$0.0$ %
Llama2-70b	$0.061$	$50.2$ %	4.6e-6	$0.5$ %
Google-PaLM	$0.0076$	$43.6$ %	$1.0$	$0.0$ %
Qwen-72b-Chat	$0.0092$	$86.4$ %	$0.065$	$3.5$ %
MetaRuleGPT	$0.0$	$100$ %	$0.0$	$100$ %

IV-D Deep Numerical Optimization Experiments on Language Models

To test our model’s mathematical reasoning and generalization capabilities, we conducted comparisons using well-known language models such as the currently leading ChatGPT-4.0, ChatGPT-3.5, Alibaba’s QWen, Google’s Palm, Llama2 and Goat. Through such comparisons, we can comprehensively assess the performance differences between models and evaluate MetaRuleGPT’s capabilities in mathematical reasoning tasks. A comprehensive comparison is shown in Fig. 6 and Table. III.

Using the specific test datasets, we previously organized to invoke and test with the aforementioned large language models, preserving and comparing the computational results of each model. We conducted a series of detailed experiments and evaluations, and the results are shown in Table. IV, V, VI, IX and VII.

Finally, in order to compare with the public dataset, we selected some subsets of gsm8k[30] and simplified the natural language part into mathematical formulas for comparison. The results are shown in VIII.

IV-E Language Model-Driven Vector Cross Product Calculation Experiment

To demonstrate our model’s capability in handling complex logical problems, we have selected vector cross product calculation, a challenging mathematical task, as a test case. This test not only verifies the model’s computational accuracy but also compares its performance with those of leading large language models. Table. X presents the detailed accuracy comparison results of various models on the vector cross product calculation dataset.

V RESULTS AND DISCUSSION

V-A Test Data Results Analysis

TABLE VI: Randomized Addition Procedure

Compute Digits	5-digit		10-digit
Rate	Error	Accuracy	Error	Accuracy
GPT-4	$0.0$	$100$ %	$0.092$	$85.5$ %
GPT-3.5	1.6e-5	$99$ %	$0.046$	$72$ %
Llama2-7b	$0.7434$	$49.5$ %	$6.6$	$0.5$ %
Llama2-13b	$0.5075$	$28.5$ %	$0.47$	$2$ %
Llama2-70b	$0.0165$	$84$ %	$0.64$	$11$ %
Google-PaLM	$0.0017$	$94$ %	$0.34$	$39.5$ %
Qwen-72b-Chat	$0.0005$	$97$ %	$0.0082$	$75$ %
MetaRuleGPT	$0.0$	$100$ %	$0.0$	$100$ %

TABLE VII: Randomized Subtraction Procedure

Compute Digits	5-digit		10-digit
Rate	Error	Accuracy	Error	Accuracy
GPT-4	$0.0$	$100$ %	$0.019$	$78.5$ %
GPT-3.5	4.9e-5	$95.5$ %	$0.052$	$76.5$ %
Llama2-7b	$1.4$	$25.5$ %	$2.1$	$3$ %
Llama2-13b	$0.49$	$31$ %	$0.67$	$1$ %
Llama2-70b	$0.085$	$73.5$ %	$1.5$	$12.5$ %
Google-PaLM	$0.15$	$80.5$ %	$0.68$	$47$ %
Qwen-72b-Chat	2.0e-4	$93.5$ %	$0.0017$	$81.5$ %
MetaRuleGPT	$0.0$	$100$ %	$0.063$	$100$ %

TABLE VIII: Simplified gsm8k Table

Model	Accuracy
GPT-4	$100$ %
GPT-3.5	$99$ %
llama2-7b	-
llama2-13b	-
llama2-70b	$98$ %
Google-PaLM	$100$ %
Qwen-72b-Chat	$100$ %
MetaRuleGPT	$100$ %

V-A1 Numerical calculation results

To assess our model’s performance in solving general numerical problems, we generated a large amount of experimental data with random numbers using Python. The preliminary results show that MetaRuleGPT and other tested language models achieved accuracy rates exceeding $70\%$ in low-digit addition and subtraction operations, demonstrating high computational precision. However, as the number of digits increased, the performance of most language models significantly declined. Except for ChatGPT, other models often made mistakes in handling high-digit calculations due to their inability to deeply grasp computational rules, almost losing their computational capability.

Remarkably, MetaRuleGPT maintained $100\%$ accuracy even when facing high-digit random addition tasks. Although it faced certain challenges in high-digit random subtraction tasks, our model still showed the highest accuracy among all tested language models. This achievement not only highlights MetaRuleGPT’s strong performance in solving complex numerical problems but also proves its generalization ability.

Table IV presents the test results of various models on misaligned subtraction. Table V shows the test results of various models on reverse amplitude subtraction. Tables VI and VII respectively demonstrate the test results of various models on randomly generated numbers. Table IX displays the test results of various models on perfect decimal addition.

V-A2 Vector Cross Product Results

From the data in Table III, it is evident that the Llama2 models with 7B and 13B parameter sizes were incapable of performing vector cross product calculations, while the Llama2-70B, the largest parameter model of the Llama2 series, could perform cross product calculations but with $0\%$ accuracy. Even the state-of-the-art ChatGPT achieved an accuracy rate below $50\%$ without the aid of external tools. In contrast, MetaRuleGPT accurately calculated vector cross products in three-dimensional space with $100\%$ accuracy, confirming the effectiveness of enhancing model capabilities by combining different rules. By comprehensively learning basic operational rules such as addition, subtraction, multiplication, and cross product, MetaRuleGPT achieved precise invocation of these rules and successfully outputted accurate calculation results. Table X provides detailed accuracy comparison results of different models on the dataset for vector cross-product computation.

TABLE IX: Perfect Decadic Addition

Compute Digits	5-digit		10-digit
Rate	Error	Accuracy	Error	Accuracy
GPT-4	$0.0$	$100$ %	2.9e-9	$98$ %
GPT-3.5	4.2e-5	$97.2$ %	2.0e-4	$91.5$ %
Llama2-7b	$15$	$13.4$ %	$30$	$5$ %
Llama2-13b	$0.16$	$3.6$ %	$10$	$3$ %
Llama2-70b	$0.14$	$5$ %	$1.8$	$6$ %
Google-PaLM	$0.89$	$52.6$ %	$24$	$27.5$ %
Qwen-72b-Chat	$0.27$	$85.8$ %	$0.21$	$67.5$ %
MetaRuleGPT	$0.0$	$100$ %	$0.0$	$100$ %

TABLE X: Vector Cross Product Table

Vector Compute	Accuracy
GPT-4	$17$ %
GPT-3.5	$5.5$ %
llama2-7b	-
llama2-13b	-
llama2-70b	$0$ %
Google-PaLM	$0$ %
Qwen-72b-Chat	$23$ %
MetaRuleGPT	$100$ %

Moreover, by training rules for two different types of tasks within the same pre-trained model, MetaRuleGPT demonstrated multi-task generalization ability. This indicates that our model is not only adaptable to various task scenarios but can also identify and apply common rules among these tasks, significantly enhancing learning efficiency. These results further showcase MetaRuleGPT’s strong performance and flexibility.

V-B Discussion

Although existing language models have demonstrated powerful capabilities, they still face challenges in terms of controllability. In particular, most models struggle to precisely answer questions within a controlled range, often resulting in significant deviations, which is a crucial issue that current language models need to address. Conversely, MetaRuleGPT’s rule-based execution ensures relative reliability and better controllability.

Experiments show that existing language models struggle with high-digit calculations and complex computational tasks due to limitations in understanding and the difficulty of learning a unified representation for text and numbers. MetaRuleGPT addresses these challenges by precisely completing complex tasks and demonstrating the generalization potential of language models in numerical calculations through rule learning.

VI Conclusion

This study explores the rule-following capabilities of language models, focusing on the combinatorial skills and generalization abilities that humans display in problem-solving. We introduce MetaRuleGPT, a Transformer-based language model that utilizes an iterative strategy and learns from a series of compound rules and sub-rules. With only 30 million parameters, MetaRuleGPT demonstrates high accuracy in handling high-digit calculations and complex vector cross-product operations, surpassing current mainstream large language models. Our findings highlight the importance of incorporating rule-based learning in language models to enhance their numerical reasoning abilities and generalization skills. MetaRuleGPT’s success in solving complex mathematical problems with relatively few parameters showcases the effectiveness of this approach and paves the way for future research in this direction.

Limitations

This research raises several issues worthy of further exploration, including:

1.

Although our model has shown certain generalization and understanding abilities after rule learning, it is limited by computational resources, and the variety of problems it can handle is relatively limited. Expanding the model’s parameter size and training with more diverse rule datasets could enable MetaRuleGPT to handle a broader range of logical tasks.
2.

The controllability of MetaRuleGPT may vary with the increase in required tasks, and we plan to add more task rules in future model training to further evaluate and improve the model’s controllability. Moreover, we aim to enhance MetaRuleGPT’s controllability when handling tasks beyond its current capabilities. For instance, while MetaRuleGPT excels in numerical computation tasks, it may produce unpredictable results when attempting function integration problems, leading to significant errors. Addressing this challenge is one of our current focuses, and we plan to introduce more rule data in future optimizations to improve the model’s overall controllability, enabling greater stability and accuracy across a wider range of applications.
3.

MetaRuleGPT currently cannot automatically handle untrained generalization forms or novel concepts beyond the meta-learning distribution, which limits its ability to tackle entirely new problems. Achieving human-like systematic generalization by leveraging real-world training experiences remains an open question and a direction for future research.

References

[1] Z. W. Lim, K. Pushpanathan, S. M. E. Yew, Y. Lai, C.-H. Sun, J. S. H. Lam, D. Z. Chen, J. H. L. Goh, M. C. J. Tan, B. Sheng et al., “Benchmarking large language models’ performances for myopia care: a comparative analysis of chatgpt-3.5, chatgpt-4.0, and google bard,” EBioMedicine, vol. 95, 2023.
[2] M. Völske, M. Potthast, S. Syed, and B. Stein, “TL;DR: Mining Reddit to learn automatic summarization,” in Proceedings of the Workshop on New Frontiers in Summarization, L. Wang, J. C. K. Cheung, G. Carenini, and F. Liu, Eds. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 59–63. [Online]. Available: https://aclanthology.org/W17-4508
[3] K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” 2015. [Online]. Available: https://arxiv.org/abs/1506.03340
[4] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” 2018. [Online]. Available: https://arxiv.org/abs/1808.08745
[5] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, “Piqa: Reasoning about physical commonsense in natural language,” 2019. [Online]. Available: https://arxiv.org/abs/1911.11641
[6] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” 2021. [Online]. Available: https://arxiv.org/abs/2009.03300
[7] L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2309.12284
[8] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang, “Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct,” 2023. [Online]. Available: https://arxiv.org/abs/2308.09583
[9] P. Wang, L. Li, Z. Shao, R. X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” 2024. [Online]. Available: https://arxiv.org/abs/2312.08935
[10] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[12] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021. [Online]. Available: https://arxiv.org/abs/2110.14168
[13] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[14] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2205.10625
[15] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot, “Complexity-based prompting for multi-step reasoning,” 2023. [Online]. Available: https://arxiv.org/abs/2210.00720
[16] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” 2023. [Online]. Available: https://arxiv.org/abs/2203.11171
[17] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” 2022. [Online]. Available: https://arxiv.org/abs/2109.01652
[18] J. Vanschoren, “Meta-learning: A survey,” 2018. [Online]. Available: https://arxiv.org/abs/1810.03548
[19] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.03400
[20] G. W. Bassel, E. Glaab, J. Marquez, M. J. Holdsworth, and J. Bacardit, “Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets ,” The Plant Cell, vol. 23, no. 9, pp. 3101–3116, 09 2011. [Online]. Available: https://doi.org/10.1105/tpc.111.088153
[21] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “Ernie: Enhanced language representation with informative entities,” arXiv preprint arXiv:1905.07129, 2019.
[22] B. M. Lake and M. Baroni, “Human-like systematic generalization through a meta-learning neural network,” Nature, vol. 623, no. 7985, pp. 115–121, 2023.
[23] OpenAI, “OpenAI’s ChatGPT: A Revolution in Language AI,” https://openai.com/blog/chat-gpt/, Sep. 2021.
[24] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[25] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
[26] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[27] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[28] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14201
[29] S. Lee and G. Kim, “Recursion of thought: A divide-and-conquer approach to multi-context reasoning with language models,” 2023. [Online]. Available: https://arxiv.org/abs/2306.06891
[30] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.

MetaRuleGPT: Recursive Numerical Reasoning of Language Models Trained with Simple Rules ††thanks: * Equal Contribution ††thanks: †{\dagger} Corresponding Author