Analyzing the Nuances of Transformers’ Polynomial Simplification Abilities

Vishesh Agarwal
Microsoft Vancouver
725 Granville St, Vancouver, BC, Canada
{visagarwal}@microsoft.com
&Somak Aditya & Navin Goyal
Microsoft Research India
9 Vigyan, Lavelle Road, KA, India
{t-soadit,navingo}@microsoft.com

Abstract

Symbolic Mathematical tasks such as integration often require multiple well-defined steps and understanding of sub-tasks to reach a solution. To understand Transformers’ abilities in such tasks in a fine-grained manner, we deviate from traditional end-to-end settings, and explore a step-wise polynomial simplification task. Polynomials can be written in a simple normal form as a sum of monomials which are ordered in a lexicographic order. For a polynomial which is not necessarily in this normal form, a sequence of simplification steps is applied to reach the fully simplified (i.e., in the normal form) polynomial. We propose a synthetic Polynomial dataset generation algorithm that generates polynomials with unique proof steps. Through varying coefficient configurations, input representation, proof granularity, and extensive hyper-parameter tuning, we observe that Transformers consistently struggle with numeric multiplication. We explore two ways to mitigate this: Curriculum Learning and a Symbolic Calculator approach (where the numeric operations are offloaded to a calculator). Both approaches provide significant gains over the vanilla Transformers-based baseline.

1 Introduction

Recently Lample & Charton (2019) showed that for the tasks of symbolic integration and solving differential equations a large number of synthetic end-to-end examples can be generated using symbolic systems. In these tasks, the authors show that Transformer networks can be trained to produce the final solution from an input integral (or differential equation) in a single step. This points to the exciting possibility of using deep neural nets to learn end-to-end theorem provers, and can be beneficial for formal mathematics (Szegedy, 2020). However, the setup combines multiple reasoning steps in a single shot: e.g., symbolic integration is a composite task and it can be non-trivial to produce intermediate steps even if the final answer is known. As the system in Lample & Charton (2019) is simply trained to output the top solution(s), it is unclear what internal mechanisms enable these models to solve these problems. An earlier work by Piotrowski et al. (2019) showed similar results for certain symbolic manipulation tasks and their work shares the same limitation.

In this paper we ask if instead of only producing the end-result of symbolic manipulation or integration, can the model produce the full sequence of intermediate steps as a human-readable proof as well. Inspired by Piotrowski et al. (2019), we explore a novel but simpler setting of polynomial simplification. In this task, we begin with a polynomial which is a sum of products of factors, where each factor is again a sum of monomials (including constants). For example,

\displaystyle P_{0}

\displaystyle=(2*x_{2}^{2})*\overbrace{(\underbrace{3*x_{2}^{1}}_{\text{term}}+4)}^{\text{factor}}+\overbrace{(5*x_{1}^{2}+x_{1}^{1}*x_{2}^{1})*(3*x_{1}^{1})*(2)}^{\text{product}}.

/* Initial */

(1)

By the simplified polynomial we mean a polynomial that is written as a sum of monomials arranged in the lexicographic order. We transform a polynomial given in the above form into the simplified form in unique steps as illustrated in the example below: First, each term in a factor is simplified. Once all factors are simplified (facstep); then within a product, all factors are multiplied (mulstep). Lastly, simplified products are summed (sumstep).

$\displaystyle P_{0}$	$\displaystyle=\underline{(2x_{2}^{2})(3x_{2}^{1}+4)}+(5x_{1}^{2}+x_{1}^{1}x_{2}^{1})(3x_{1}^{1})(2),$	/* facstep */	(2)
	$\displaystyle=(2x_{2}^{2})(3x_{2}+4)+\underline{(5x_{1}^{2}+x_{1}^{1}x_{2}^{1})(3x_{1}^{1})(2)},$	$\displaystyle(P_{1}),\textsc{/* facstep */}$	(3)
	$\displaystyle=\underline{(2x_{2}^{2})(3x_{2}+4)}+(5x_{1}^{2}+x_{1}x_{2})(3x_{1})(2),$	$\displaystyle(P_{2}),\textsc{/* mulstep */}$	(4)
	$\displaystyle=(6x_{2}^{3}+8x_{2}^{2})+\underline{(5x_{1}^{2}+x_{1}x_{2})(3x_{1})*(2)},$	$\displaystyle(P_{3}),\textsc{/* mulstep */}$	(5)
	$\displaystyle=\underline{(6x_{2}^{3}+8x_{2}^{2})+(30x_{1}^{3}+6x_{1}^{2}*x_{2})},$	$\displaystyle(P_{4}),\textsc{/* sumstep */}$	(6)
	$\displaystyle=30x_{1}^{3}+6x_{2}^{3}+6x_{1}^{2}x_{2}+8*x_{2}^{2}.$	$\displaystyle(P_{5}),\textsc{/* Endpoint */}.$	(7)

In addition to the above setting (termed Coarse), we define a Finer setting, where a facstep involves simplification of a single term, a mulstep involves multiplications of only two factors at once, and each sumstep involves addition of only two products at once. Piotrowski et al. (2019) explores the task of learning symbolic re-write of an entire expression. In contrast, in our setting, for step-wise prediction, at each step the system needs to find the candidate sub-expression and a relevant simplification type to perform the simplification. This setup resembles the traditional ATP setup where a system needs to learn and execute symbolic steps to reach a final solution. But it is much simpler, as for each step only one unique simplification is applicable. By the proof for the simplification of the initial polynomial ( $P_{0}$ ) we mean the sequence of simplification steps (leading to $P_{1}$ to $P_{5}$ in the example above). A model trained on step-wise prediction task, can be used to generate a full proof: we start with an initial polynomial, and recursively feed the model output to itself, till it generates the final simplified polynomial (in the normal form). We say that a proof is correct when all steps are correct.

Summary of our contributions.

(1) Definition of a step-wise polynomial simplification task and the dataset generation algorithm, (2) identification of bottlenecks in Transformers’ math abilities validated with extensive hyperparameter search, (3) a neuro-symbolic experiment (to further understand one of these bottlenecks) where the arithmetic calculations are outsourced to a calculator, which results in significant gains across most configurations, and (4) curriculum learning for 1-variable setting, where we observe significant gains (Table 18 appendix H).

2 Polynomial Simplification Dataset & Task Setup

We use the the Sympy library to generate the symbolic polynomials and simplification steps. To have a fine-grained control over the generated polynomials and the proof steps, we consider starting polynomials which are sums of products (e.g. $P_{0}$ in Eqn. 1).¹¹1Restriction over the form of the polynomial helps us generate unique proofs, which are easier to evaluate. Detailed sampling algorithm in appendix A. We randomly sample starting point polynomials. For both coarse and finer configurations, we build the proof steps as follows: (1) first we do a sequence of facsteps where terms get collected within a factor (such as $2x+3x$ to $5x$ , $x^{1}$ and $1x$ become $x$ ); (2) then a sequence of mulsteps are performed where simplified factors are multiplied out; and (3) lastly, in sumstep simplified products are added together. The sequence of simplification steps till the endpoint constitute a full proof. The simplified endpoint polynomials are in normal form, similar to $P_{5}$ in Eqn. 7.

Dataset Configurations.

We vary dataset configurations along the following dimensions: (1) The number of variables in the polynomial is either 1 or 2; (2) coefficients size: the maximum coefficients in the polynomials, products and factors resp., are varied from $\{60,20,5\}$ (Small), to $\{120,40,8\}$ (Medium) to $\{300,100,10\}$ (Large) (Default is $\{120,40,8\}$ ); (3) the maximum degree in polynomial and a factor has two configurations: $\{6,3\}$ (Default), and $\{12,5\}$ (Medium Degree); (4) the maximum number of terms in a simplified product and a factor has two configurations: $\{8,3\}$ (Default), and $\{20,4\}$ (Medium Terms). For the latter, we also set maximum products in a sum and maximum factors in a product as 5 and 4 respectively. For variation along a dimension, defaults for other dimension is used, e.g., default degree and terms for Small Coeff. Lastly, we try a very large configuration (No Backtrack) where maximum coefficients in polynomial, product and factor are $\{10125,3375,5\}$ , maximum degree in polynomial and factor are $\{9,3\}$ ; and maximum terms in a product is $27$ . This is a configuration, where no sampled factor, or product is ever rejected for violating any higher-level constraint; thus capturing the effect of all constraints at once. As input representations, we vary between the prefix and infix traversals of the abstract syntax tree of the polynomial input as sequences.

Tasks and Metrics.

We identify two tasks : (1) step-wise prediction: where an input polynomial is provided and the task is to perform the next proof step, and (2) endpoint prediction: where given a polynomial, the task is to predict the fully simplified polynomial in one shot. For a fair comparison with the endpoint prediction task, we evaluate the stepwise models on their full proof accuracy. We report the percentage of correct proofs (full proof acc) and percentage of correct steps (stepwise acc) using the model’s top prediction (beam size 1) for every proof step . For the step-wise prediction task the full proof accuracy is the percentage of proofs where all individual proof steps are accurate.²²2This is done using teacher forcing. We have attempted recursive proof generation as well, where the output from the decoder is fed to the encoder in the next step. With recursive generation, if in any step the model went wrong, it did not recover after that. Hence, proof accuracy with teacher-forcing is a lower bound on that. We also report the following: (1) error percentages grouped by each different types of steps (facstep, mulstep, and sumstep), (2) calibration scores of the systems based on a threshold. To compute the accuracy for an example (in both the tasks), we use the simplify method of Sympy library and check symbolically whether the difference between the predicted expression and the ground-truth expression is equal to zero. We also calibrate the model over how confident it is of its predictions, by taking the natural log ratio of probabilities of top 2 outputs with beam 5. Whenever the ratio is greater than a threshold (usually 5), we mark that output as being sure. Correspondingly, we report the sure rate (percentage of outputs marked sure), precision, recall and F-1 score for calibration.

2.1 Experimental Results

Our Model.

We train a seq2seq network to predict the next proof step provided a polynomial as a sequence. We train a Transformer network (Vaswani et al., 2017) architecture using Adam optimizer (Kingma & Ba, 2014). To compare across configurations, we use a default hyperparameter setting (denoted by Transformers-S): 4 attention heads, 4 enc/dec layers, embedding size 256, learning rate $10^{-4}$ , batch size $32$ . For larger coefficient configurations, we do an exhaustive hyperparameter search varying Transformers size (S and L)³³3S = 4 enc/dec layers, 4 H; L = 6 enc/dec layers, 8H. Learning rates were at first: 1e-2, 5e-3, 5e-4. But,
lr $>5*10^{-4}$ resulted in near zero validation scores. lr $10^{-5}$ convergence took a long time., embedding size (256, 512), learning rate (1e-4, 5e-4, 1e-5), dropout (0, 0.5), batch-size (32, 64, 128), input representation (prefix/infix) and proof granularity (coarse/fine). During training, we synthetically generate each batch of equations. To avoid collisions between train and test sets, we first use a fixed seed to generate the test and the validation sets of polynomial simplification full proofs and collect the simplified end-points. While sampling training batches, we make sure that the simplified versions of the input polynomial in the training batches, do not collide with any endpoints in the the test and validation set.⁴⁴4Authors in Piotrowski et al. (2019) show that the probability of such collisions in the generated integration dataset by Lample & Charton (2019) to be quite high, and urge to report the test accuracy by accounting for such collisions explicitly. During inference, we use beam-search with different beam widths (beam 1 and 5) to decode the expressions. For our results, beam width 1 is used for proof accuracy. Calibration results are produced using beam 5 decoding. During decoding, if any malformed (prefix or infix) expressions are generated, we report the percentage of such expressions.⁵⁵5Similar to Lample & Charton (2019), we find that the percentage of malformed outputs was very low ( $<$ 0.5%). So we did not explicitly correct for it.

For the Transformers-S setting, we report the results for one and two variables for all configurations in Tables 1 and 6 (appendix D). In Table 1, we observe that Coarse proof-steps with Prefix representation provides the best full proof accuracy for four out of six configurations (especially for larger coefficient sizes). From the calibration results, we see that the winning combinations often provide the highest calibration F-1 score (more prominent for 2 variables), indicating lesser ambiguity in the decision made. As coefficient sizes grow from Small Coeff to No Backtrack, for 1 variable, the endpoint accuracy is only slightly higher ( $1$ to $2\%$ ) than the full proof accuracy. For Medium Terms and Medium Degree, the Endpoint accuracy shows a $3.6\%$ and $13\%$ improvement respectively. For 2 variables, endpoint task accuracy is much larger in most cases. From step-wise error analysis (Appendix Tables 8 & 9), shows that for 1 variable, more than 80% of the model errors occur in the multiplication step. In most cases, both factor simplification and addition cause close to 5% of the model errors each.

Polynomial Config

Proof/Input Format

Endpoint

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

#EE

Endpoint

Acc

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

Small Coeff

Fine/Infix

4.8M

98.9

99.79

94.46

92.38

100

97.8

0.99

Med Coeff

Coarse/Prefix

6.1M

95.87

5.3M

93.6

98.58

86.6

88.47

82.83

99.88

95.54

0.98

Large Coeff

Coarse/Prefix

6.5M

85.87

3.5M

83.5

96.25

80.6

83.3

99.91

92.97

0.96

No BT

Coarse/Prefix

6.6M

78.87

5.6M

79.7

95.38

81.93

85.57

72.2

100

88.12

0.94

Med Deg

Coarse/Infix

9.2M

96.4

4.9M

92.8

98.26

87.18

88.96

81.12

100

93.05

0.96

Med Terms

Coarse/Prefix

89.8

4.3M

76.3

95.78

87.8

91.17

81.3

100

92.6

0.96

Table 1: Transformers-S results for 1 variable. Only highest proof accuracy combinations are shown for each coefficient configuration. Full results in Appendix.

Polynomial Config

Proof/Input Format

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

Large Coeff

h/infix

3.6M

84.6

96.79

96.5

98.02

91.62

99.93

94.88

0.97

cal/infix

3.2M

95.7

99.16

97.98

98.72

95.58

99.94

97.49

0.99

No Backtrack

h/infix

3.2M

82.6

96.85

95.98

97.86

88.42

99.98

92.1

0.96

cal/infix

3.8M

87.2

97.94

97.26

98.56

94.18

99.94

96.77

0.98

Medium Degree

h/infix

3.3M

90.8

97.81

96.66

98.14

90.34

99.93

93.4

0.97

cal/infix

2.8M

90.5

98.08

97.92

98.94

92.86

99.96

94.79

0.97

Medium Terms

h/infix

3.5M

79.5

96.49

96.16

97.92

91.19

100

94.82

0.97

cal/infix

3.7M

89.5

98.45

97.88

99.08

94.88

99.92

96.85

0.98

Table 2: 1 variable, hyperparmeter-tuning and symbolic calculator experiments. h/* rows denote the performance of models with best validation scores over all explored hyperparameter configurations (including input representation variations). cal/* provides the same for the calculator setting for the best models after hyperparameter search. Full table in Appendix.

Symbolic Calculator.

As 80% of the errors occurred in multiplication step, we separately tested the Transformer’s ability to do arithmetic, by creating datasets involving multiplication and addition of 4-digit and 9-digit numbers. While the models quickly achieved an accuracy of 99% for addition; for multiplication, they could not go beyond even 1% after seeing 2M examples. Hence, we devise a setting where polynomial simplification steps only involve symbolic addition and multiplication, without any arithmetic manipulation. For example, for given expressions $(3*x_{1})*(4*x_{1})$ , the model is trained to output $[3*4]*x_{1}^{2}$ , where “ $[\cdot]$ ” signifies that a calculator needs to simplify the inner expression. In Tab. 2, we report results for both the standard and the calculator setting. Each row denotes the test-set results for the model with best validation score⁶⁶6Full table and best hyperparameter settings per row in Appendix Table 10., for a fixed dataset configuration and input representation. As shown, in many configurations there is a significant increase in proof accuracy (11% in Large Coeff and Medium Terms) for both prefix and infix representation.

Curriculum Learning.

For this task, the sub-tasks (addition, multiplication) and their dependencies are well-defined. To exploit such dependencies, we explore different curriculum strategies based on the Mastering-Rate-based (MR) curriculum learning algorithm by Willems et al. (2020). In this work, authors define the task dependencies as a graph, and a sub-task is only sampled from (or learnt) when its parent tasks are mastered. From our experiments (detailed in Appendix), we observe that in 1-variable setting, as coefficient size grows from Small, Medium, Large to No Backtrack - the improvements in full proof accuracy steadily increase from $1\%$ to $10.84\%$ .

Conclusion and Results Summary.

We define the polynomial simplification task and present the synthetic dataset generation procedure. A default Transformers settings show that for smaller configurations, Transformers learn to generate whole proofs with very high accuracy. For large configurations, even with exhaustive hyper-parameter tuning, even a larger Transformer model suffers. We identify that arithmetic multiplication is a consistent bottleneck for Transformers. Through a neuro-symbolic model, where numeric operations are outsourced to a calculator, we observe high gains on full proof accuracy. Lastly, we observe carefully designed curricula can also boost full proof accuracy up to 10% for large coefficient sizes. These results indicate some potential ingredients that could be useful for designing very high accuracy models.

References

Graves et al. (2017) Alex Graves, Marc G Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In International Conference on Machine Learning, pp. 1311–1320, 2017.
Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Lample & Charton (2019) Guillaume Lample and François Charton. Deep learning for symbolic mathematics. In International Conference on Learning Representations, 2019.
Matiisen et al. (2019) Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. IEEE transactions on neural networks and learning systems, 2019.
Piotrowski et al. (2019) Bartosz Piotrowski, Josef Urban, Chad E. Brown, and Cezary Kaliszyk. Can neural networks learn symbolic rewriting?, 2019.
Szegedy (2020) Christian Szegedy. A promising path towards autoformalization and general artificial intelligence. In Christoph Benzmüller and Bruce Miller (eds.), Intelligent Computer Mathematics, pp. 3–20, Cham, 2020. Springer International Publishing. ISBN 978-3-030-53518-6.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
Willems et al. (2020) Lucas Willems, Salem Lahlou, and Yoshua Bengio. Mastering rate based curriculum learning, 2020.

Appendix A Polynomial Simplification Dataset Creation and The Sampling Algorithm

We proceed similarly to Lample & Charton (2019) to generate the symbolic polynomials and simplified steps synthetically using the Sympy library of Python. To have a fine-grained control over the generated polynomials and well-defined proof steps, we consider polynomials which are sums of products⁷⁷7The generation algorithm in Lample & Charton (2019) may generate nested sums and products. For such polynomials, an unique proof sequence is hard to define which makes whole proofs harder to evaluate. Our restriction over the form of the polynomial helps us generate unique proofs, which are easier to evaluate.. We also note that symbolic generation using the Sympy library lets us ensure correctness of each generated expressions and validity of each steps.

A.1 Notations

We start with the set of variables $x_{\mathcal{P}}=\{x_{1},\ldots,x_{\mathrm{nvar}}\}$ . We represent the starting point polynomial $\mathcal{P}_{0}$ in $x_{\mathcal{P}}$ as the sum of products of factors:

	$\displaystyle\mathcal{P}_{0}$	$\displaystyle=P_{1}+P_{2}+\ldots+P_{\mathrm{nprod}},$		(8)
	$\displaystyle P_{i}$	$\displaystyle=\prod_{j=1}^{\mathrm{nfac}_{i}}f_{ij},$		(8)

where each factor ( $f_{ij}$ ) has the form $f=\sum_{k}({a_{k}*\prod_{l}x_{kl}^{d_{kl}}})$ , where $~{}x_{kl}\in x_{\mathcal{P}}$ (dropping $i,j$ for clarity). Here coefficients $a_{k}\in\mathbb{N}^{+}$ , and powers of the variables $d_{kl}\in\mathbb{N}$ . $\mathrm{nprod}$ is the number of products and $\mathrm{nfac}_{i}$ denotes the number of factors in $P_{i}$ .
We denote the set of factors as $f_{\mathcal{P}}=\{f_{ij}|\exists i,P_{i}=\prod_{j=1}^{\mathrm{nfac}_{i}}f_{ij}\}$ . The simplified endpoint polynomial is of the form $\mathcal{\hat{P}}=\sum_{m=1}^{q}\hat{t}_{m}$ , where $\hat{t}_{m}=\hat{a}_{m}*\prod_{n}{x_{n}}^{d_{mn}}$ , where $x_{n}\in x_{\mathcal{P}}$ . We use the symbol $\hat{P}_{i}$ to denote the simplified form of $P_{i}$ . The functions $\mathrm{terms}(),\mathrm{vars}(),\mathrm{coeffs}()$ returns a list of terms, variables, coefficients in the input expression. Our sampling algorithm guarantees that the generated polynomial and its simplified endpoint abides by constraints on number of terms, products, factors and variables; limit on degree and coefficient sizes. An example is $\mathrm{nprod}\in\{2,\ldots,\mathrm{maxP_{\mathcal{P}}}\}$ . We provide the full list of constraints and notations in Table 3.

Term

Constraints

#Products

#Factors in

P_{i}

#Terms in

f_{ij}

#Terms in

\hat{P}_{i}

\mathrm{nprod}\in\{2,\ldots,\mathrm{maxP_{\mathcal{P}}}\}

\mathrm{nfac}_{i}\in\{2,\ldots,\mathrm{maxf_{P}}\},\forall i\in\{1,\ldots,\mathrm{nprod}\}

|\mathrm{terms}(f_{i})|\in\{1,\ldots,\mathrm{maxT_{f}}\},\forall f_{i}j\in f_{\mathcal{P}}

|\mathrm{terms}(\hat{P_{i}})|\leq\mathrm{maxT_{P}}\forall P_{i}\in\mathcal{P}_{0}

Degree

Constraints

#Degree in

\hat{\mathcal{P}}

#Degree in

f_{ij}

\sum d_{mn}\leq\mathrm{D_{\mathcal{P}}},~{}\forall m~{}\hat{t_{m}}\in\mathrm{terms}(\hat{P}),\forall n~{}x_{n}\in\mathrm{vars}(\hat{t_{m}})

\sum d_{kl}\leq\mathrm{D_{f}},\forall k~{}\mathrm{terms}(f_{ij}),\forall f_{ij}\in f_{\mathcal{P}}

Variable

Constraints

#Variables in

\mathcal{P}_{0}

#Variables in

P_{i}

#Variables in

f_{i}

|x_{\mathcal{P}}|\leq\mathrm{V_{\mathcal{P}}}

|\mathrm{vars(P_{i})}|\leq\mathrm{V_{P}},\forall P_{i}\in\mathcal{P}_{0}

|\mathrm{vars(f_{ij})}|\leq\mathrm{V_{f}},\forall f_{j}\in f_{\mathcal{P}}

Coefficient

Constraints

Coeff in

\hat{\mathcal{P}}

Coeff in

\hat{P}_{i}

Coeff in

f_{i}

\hat{a_{j}}\leq\mathrm{C_{\mathcal{P}},}\forall\hat{a_{j}}\in\mathrm{coeffs}(\hat{P})

\hat{a_{ij}}\leq\mathrm{C_{P}},\forall a~{}\mathrm{coeffs}(\hat{P_{i}}),\forall P_{i}\in\mathcal{P}_{0}

a_{k}\leq\mathrm{C_{f}},\forall a~{}\mathrm{coeffs}(f_{ij}),\forall f_{ij}\in f_{\mathcal{P}}

Table 3: List of notations, and corresponding constraints that a generated polynomial satisfies.

A.2 Building a Polynomial Proof

Here, we briefly describe the starting polynomial generation process; detailed algorithm is in the appendix. Any randomly sampled polynomial (represented as a sum of products) can be included as a starting point in the dataset as long as the polynomial respects certain configuration parameters (in Appendix Table 3). This is unlike Lample & Charton (2019), where many randomly generated integrals (or differential equations) might not have a solution. Hence, we randomly sample the constraint parameters in a top-down manner; and then construct terms, factors and products in a bottom-up manner using the parameters. We first sample the following 1) a set of participating variables ( $x_{\mathcal{P}}$ ), 2) maximum degree for any monomial in the simplified polynomial ( $\mathrm{mdeg}$ ), and 3) the number of products in the starting polynomial ( $\mathrm{nprod}$ ). We then call the algorithm buildProduct (Algorithm 1 in appendix) to create $\mathrm{nprod}$ individual products.

Input:

x_{\mathcal{P}}

, mdeg

Constraints : nvars_prod, max_coeff_prod, max_fac_prod, max_terms_prod

Output: A list of factors

F_{seq}

1 Sample

nvar

\in\{\texttt{num\_vars\_fac},\ldots,\texttt{nvars\_prod}\}

nvar=min(|x_{\mathcal{P}}|,nvar)

Sample

nvar

variables from

x_{\mathcal{P}}

x_{\mathcal{P}_{i}}

// Variable set for this product

Sample

nfac

\in\{2,\ldots,\texttt{max\_fac\_prod}\}

// #Factors for this product

/* Get maximum degree, terms and coefficient available */

rdegree

mdeg

rterms

= max_terms_prod,

rcoeff

= max_coeff_prod

cprod

= 1

// Keeping track of product built till now

F_{seq}=[~{}]

5 for $j\leftarrow 1$ to $nfac$ do

f_{j}=\texttt{buildFactor}(x_{\mathcal{P}_{i}},rdegree,rterms,rcoeff)

/* Update degree, terms and coefficient for next factor */

cprod=cprod*f_{j}

rdegree=rdegree-degree(f_{j})

rterms

= max_terms_prod/

|terms(cprod)|

rcoeff

= max_coeff_prod/

max(coeffs(cprod))

11 Append

f_{j}

F_{seq}

12 if $rdegree==0$ then

13 break

15 end if

17 end for

Shuffle

F_{seq}

Algorithm 1 BuildProduct (Sampling Products)

Building a Product

In buildProduct (Algorithm 1), first we sample $\mathrm{nfac}_{i}$ , the maximum number of factors in the product ( $P_{i}$ ). We then build factors sequentially. For each new factor, we sample a subset of variables in a factor. We pass on product-level constraints such as maximum degree in a product, maximum terms in a product, and maximum coefficient for a product as $\mathrm{rdegree},\mathrm{rterms}$ and $\mathrm{rcoeff}$ respectively; and call the sub-routine buildFactor (Algorithm 2 to create a factor. After a factor is sampled, the constraints $\mathrm{rdegree},\mathrm{rterms}$ and $\mathrm{rcoeff}$ are updated. buildFactor is used to create at most $\mathrm{nfac}_{i}$ factors, that all abide by the above constraints and stops if the limit of maximum degree in the product is reached. The terms in a factor are arranged in a lexicographical order. Since, this sequential generation of factors may induce a certain pattern of decreasing degrees and coefficients, we shuffle the factors to create the final product.

Simplification Steps and Full Proof

For both coarse and finer configurations, we build the proof steps in the following way: 1) first we do a sequence of facsteps where terms get collected within a factor (such as $2x+3x$ to $5x$ , $x^{1}$ and $1x$ becomes $x$ ); 2) then a sequence of mulsteps are performed where simplified factors are multiplied out; and 3) lastly, in sumstep simplified products are added together. As mentioned before, the sequence of simplification steps till the endpoint constitute a full proof.

The polynomial sampling algorithms buildProduct and buildFactor are provided in Algorithms 1 and 2 respectively.

Input:

x_{\mathcal{P}_{i}},rdegree,rterms,rcoeff

Constraints : num_vars_fac, max_coeff_fac, max_terms_fac, max_degree_fac

Output: A factor

f_{j}

, Number of terms

nterms_{j}

1 Sample

nvar

\in\{1,\ldots,\texttt{num\_vars\_fac}\}

cvars

= Sample

nvar

variables from

x_{\mathcal{P}_{i}}

// Variable set for this factor

2 Sample

nterms\in\{1,\ldots,min(\texttt{max\_terms\_fac},rterms)\}

// # Terms for this factor

3 Sample

\{d_{k}\}^{nterms}_{k=1}

, s.t.

d_{k}\in\{0,\ldots,min(\texttt{max\_degree\_fac},rdegree)\}

// Term degrees: degree 0 allows for constant terms

4 Sample

\{c_{k}\}^{nterms}_{k=1}

, s.t.

c_{k}\in\{1,\ldots,min(\texttt{max\_coeff\_fac},rcoeff)\}

// Term coefficients

5 for $k\leftarrow 1$ to $nterms$ do

6 selects

d[k]

variables from

cvars

with replacement

// E.g. if

d[k]=4,cvars=[x_{1},x_{2}]\text{. May sample }[x_{1},x_{2},x_{1},x_{1}]

Convert the selected

d[k]

variables to a term

t_{k}=c_{k}*x_{1}^{3}*x_{2}

8 end for

f_{j}=\sum_{k=1}^{nterms}t_{k}

10 return

f_{j}

Algorithm 2 BuildFactor (Sampling A Factor)

Appendix B Result Table: FAQs

Q: How do you generate a proof?: We feed each input step to the model, and get the next step using greedy decoding.
Q: When is a proof marked correct?: When all the intermediate steps generated and the final simplified polynomial match the ground truth. These results are recorded in the full proof major column.
Q: When is the step-wise major column then?: We also tried using beam-5 for decoding. The results for that are summarised, per step, in the step-wise major column.
Q: Why are ”full proof $>$ stepwise accuracy” and ”step-wise $>$ top-1 accuracy” different?: As explained earlier, full proof results are with greedy decoding, and step wise are with beam-5. Since the top output for beam-5 can be different from greedy decoding, we have these two different metrics. As one can see, greedy decoding usually does better than beam search.
Q: How are ”step-wise $>$ top-1 accuracy” and ”step-wise $>$ beam-5 accuracy” different?: Both of those metrics are for beam-5 decoding. Top-1 accuracy measures how many times the top outputs in the beam matches the ground truth, whereas beam-5 accuracy measures how many times ANY output in the beam matches ground truth.
Q: What is calibration?: Calibration is used to quantify how confident a model of its predictions. If the top prediction of a model also has a very low probability, we can say that the model is not able to come up with a good answer. We use the log ratio of model probability of top 2 outputs with beam-5 decoding to measure how sure model is of its top prediction. We call this ratio the confidence score of the model.
Q: Why use the log ratio of top 2?: Note that in our configuration, only one output is correct for each input. In our experiments we found that when the model found the correct output, it gave a very high score to it, and low scores for rest of the beam predictions. Whereas, when it gave incorrect answers, the top few beam outputs had similar (and low) scores. So, we use the log ratio measure for confidence.
Q: How do you know when the confidence score is enough to mark an output as sure?: We used AUC-ROC and AUPRC to calibrate a threshold for the model calibration scores. An output with score above the threshold is marked ”sure” and below that is marked ”unsure”.
Q: What are the subcolumns under the calibration major column?: Sure rate is the %age of outputs marked as correct. Precision measures how many sure outputs are correct. Recall measures how many of the correct outputs were marked as sure. The F-1 score is calculated using that precision and recall.
Q: Finally, what are #EE and #Train?: To compare the sample efficiency with the baseline models, we note the number of examples the models see before convergence. Note that for the ENDPOINT model, an example consists of an initial polynomial and final simplified version pair. Whereas for the stepwise models, one example consists of one proof step (NOT the whole proof).

Appendix C Problem Space Size Estimation

For smaller configurations, it is probable that eventually all simplified polynomials would be included in the training data. To account for this, we estimate the problem space size for each configuration and report the size of training data for comparison. We randomly generate two sets of starting polynomials say $S_{1}$ and $S_{2}$ , and check for collisions among them. Assuming the actual size is $X$ and uniform distribution over all starting polynomials, the expected number of collisions would be $R=\frac{S_{1}*S_{2}}{X}$ . Using the above method, we estimate the number of un-simplified polynomials and the number of unique endpoints, and report in Table 4. We observe that compared to the number of training examples it took for the models to converge in both End-point and Step-wise prediction tasks, the space of possible equations is often 25 (or more) times higher.

Sampled polynomials are not uniformly distributed as we assign an equal probability while sampling polynomials of lower and higher degrees, say 3 and 6; whereas there are more polynomials of degree 6 than degree 3. For non-uniform distributions, we expect more collisions as higher probability equations are more likely to occur in both $S_{1}$ and $S_{2}$ . Moreover, since many equations may map to the same endpoint, such collisions for endpoints are even more likely. Thus, our empirical estimate of the population size provides a lower bound on the true value.

Config

NVAR = 1

NVAR = 2

Equation

Size Estimate

Endpoint

Size Estimate

Equation

Size Estimate

Endpoint

Size Estimate

SMALL COEFF

104M

8.24M

184M

27.4M

MEDIUM COEFF

179M

16.3M

325M

42.4M

LARGE COEFF

289M

32M

507M

68.8M

NO BACKTRACK

324M

54.9M

538M

104M

MEDIUM DEG

459M

67.4M

902M

144M

MEDIUM TERMS

866M

31.5M

1.73B

801M

Table 4: Size Estimates for the problem space, after generating sets of size 5M.

Appendix D Input Representation (Additional Results)

For Transformers-S setting, we present the complete results for 1 variable in Table 5. For 2 variables, the Coarse configuration results are in Table 6 and FINE configuration in Table 7. The errors made by the models for 1 Variable and 2 Variable settings are presented in Tables 8 and 9 respectively.

Polynomial Config

Proof/Input Format

Endpoint

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

#EE

Endpoint

Acc

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

Small Coeff

Coarse/Infix

3.6M

98.83

88.13

89.67

83.2

100

94.4

0.97

Fine/Infix

4.8M

98.9

99.79

94.46

92.38

100

97.8

0.99

Coarse/Prefix

5.2M

97.8

3.2M

95.3

98.97

87.83

89.37

83.03

100

94.54

0.97

Fine/Prefix

5.2M

97.8

4.4M

96.9

99.4

95.1

95.83

93.13

99.96

97.9

0.99

Medium Coeff

Coarse/Infix

4.1M

91.2

4.3M

92.8

98.24

88.97

91.67

84.3

100

94.75

0.97

Fine/Infix

4.1M

91.2

2.9M

90.3

97.99

86.1

87.68

81.14

100

94.24

0.97

Coarse/Prefix

6.1M

95.87

5.3M

93.6

98.58

86.6

88.47

82.83

99.88

95.54

0.98

Fine/Prefix

6.1M

95.87

4.5M

91.7

98.37

95.1

96.43

91.27

100

95.97

0.98

Large Coeff

Coarse/Infix

4.8M

83.73

3.4M

82.1

95.97

92.34

94.22

87.2

99.98

94.41

0.97

Fine/Infix

4.8M

83.73

3.4M

82.5

96.44

92.32

94.26

87.5

99.98

94.76

0.97

Coarse/Prefix

6.5M

85.87

3.5M

83.5

96.25

80.6

83.3

99.91

92.97

0.96

Fine/Prefix

6.5M

85.87

3.2M

96.32

79.13

80.63

75.57

99.96

95.45

0.98

No Backtrack

Coarse/Infix

5.9M

80.1

3.8M

75.6

94.62

72.74

77.28

61.8

99.9

84.88

0.92

Fine/Infix

5.9M

80.1

74.5

94.76

88.34

90.9

79.44

99.92

89.86

0.95

Coarse/Prefix

6.6M

78.87

5.6M

79.7

95.38

81.93

85.57

72.2

100

88.12

0.94

Fine/Prefix

6.6M

78.87

4.2M

74.7

95.23

82.03

72.23

100

91.43

0.96

Medium Degree

Coarse/Infix

9.2M

96.4

4.9M

92.8

98.26

87.18

88.96

81.12

100

93.05

0.96

Fine/Infix

9.2M

96.4

3.3M

83.4

96.12

88.26

90.44

83.04

99.95

94.04

0.97

Coarse/Prefix

94.33

4.3M

87.7

96.82

77.33

82.13

69.33

100

89.66

0.95

Fine/Prefix

94.33

5.9M

90.6

97.92

82.2

83.7

77.27

99.96

93.96

0.97

Medium Terms

Coarse/Infix

4.6M

81.9

2.3M

72.7

93.99

79.44

82.22

68.22

99.97

85.85

0.92

Fine/Infix

4.6M

81.9

2.8M

75.1

95.42

86.2

88.48

76.72

99.92

88.93

0.94

Coarse/Prefix

89.8

4.3M

76.3

95.78

87.8

91.17

81.3

100

92.6

0.96

Fine/Prefix

89.8

3.2M

74.8

95.55

87.67

90.4

76.9

100

87.72

0.93

Table 5: Results for 1 variable in the Coarse and Fine configuration for both Infix and Prefix representation.

Polynomial Config

Input Format

Endpoint

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

#EE

Endpoint

Acc

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

Small Coeff

Infix

4.3M

94.7

3.7M

87.9

97.01

88.9

81.07

100

91.19

0.95

Prefix

4.5M

93.93

5.3M

91.2

98.08

83.83

86.7

77.57

100

92.52

0.96

Medium Coeff

Infix

95.3

5.3M

88.5

97.35

90.98

93.7

84.64

99.98

93.01

0.96

Prefix

5.2M

92.77

4.8M

84.5

96.03

89.57

92.93

81.27

99.96

90.7

0.95

Large Coeff

Infix

91.8

3.8M

80.4

95.18

90.44

93.14

82.74

99.93

91.42

0.95

Prefix

6.1M

86.6

5.4M

83.7

96.23

92.23

94.57

86.03

100

93.28

0.97

No Backtrack

Infix

8.6M

83.8

72.7

93.13

75.48

78.74

64.4

100

85.32

0.92

Prefix

7.1M

79.2

4.3M

63.2

89.87

72.07

76.43

59.4

99.94

82.38

0.9

Medium Degree

Infix

4.9M

87.9

5.1M

80.5

95.13

90.3

92.53

80.63

100

89.29

0.94

Prefix

5.2M

83.73

6.1M

83.4

96.41

92.07

94.43

83.13

99.96

90.26

0.95

Medium Terms

Infix

8.5M

3.8M

92.03

80.5

83.66

66.62

100

82.76

0.91

Prefix

6.6M

87.07

6.3M

67.8

93.58

89.7

91.57

80.33

99.96

89.52

0.94

Table 6: Results for 2 variables for the Coarse configuration for both Infix and prefix representations.

Polynomial Config

Proof/Input Format

Endpoint

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

#EE

Endpoint

Acc

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

SMALL COEFF

Infix/Fine

4.3M

94.7

4.6M

88.1

97.19

90.7

92.2

83.47

100

92.02

0.96

Prefix/Fine

4.5M

93.93

5.4M

90.3

97.83

94.63

96.2

87.9

99.96

92.85

0.96

MEDIUM COEFF

Infix/Fine

95.3

4.4M

82.2

96.25

94.28

95.76

86.24

100

91.47

0.96

Prefix/Fine

5.2M

92.77

2.9M

72.4

93.6

91.53

94.33

81.97

100

89.55

0.94

LARGE COEFF

Infix/Fine

91.8

3.2M

93.85

77.94

82.2

99.9

80.75

0.89

Prefix/Fine

6.1M

86.6

4.7M

78.6

95.6

91.93

93.47

83.87

100

91.23

0.95

NO BACKTRACK

Infix/Fine

8.6M

83.8

5.8M

72.5

94.64

81.54

84.82

72.34

100

88.72

0.94

Prefix/Fine

7.1M

79.2

4.1M

60.7

90.48

81.73

85.67

70.2

99.91

85.81

0.92

MEDIUM DEG

Infix/Fine

4.9M

87.9

3.6M

73.5

94.21

89.78

92.46

77.22

100

86.01

0.92

Prefix/Fine

5.2M

83.73

4.6M

73.6

94.57

86.5

89.4

76.93

100

88.94

0.94

MEDIUM TERMS

Infix/Fine

8.5M

4.8M

92.98

79.04

81.86

66.92

99.88

84.56

0.92

Prefix/Fine

6.6M

87.07

4.5M

62.9

92.74

86.4

89.07

73.67

100

85.26

0.92

Table 7: Results for fine configuration for 2 Variables for Infix and Prefix representation (No curriculum, No annotation).

Config

Proof Type

Full Proof

Error Percentage

Full Proof

Accuracy

Greedy

Stepwise

Accuracy

First

FacStep

Total

FacStep

First

MulStep

Total

MulStep

First

SumStep

Total

SumStep

SMALL COEFF

Coarse/Infix

98.83

9.43

84.91

5.66

Fine/Infix

98.9

99.79

100

Coarse/Prefix

95.3

98.97

4.26

4.08

72.34

71.43

23.4

24.49

Fine/Prefix

96.9

99.4

9.68

77.42

12.9

MEDIUM COEFF

Coarse/Infix

92.8

98.24

1.39

1.25

95.83

92.5

2.78

6.25

Fine/Infix

90.3

97.99

11.34

11.32

85.57

84.91

3.09

3.77

Coarse/Prefix

93.6

98.58

3.12

2.94

95.31

95.59

1.56

1.47

Fine/Prefix

91.7

98.37

2.41

2.33

96.39

96.51

1.2

1.16

LARGE COEFF

Coarse/Infix

82.1

95.97

3.35

3.02

93.85

91.46

2.79

5.53

Fine/Infix

82.5

96.44

2.86

2.56

93.71

90.77

3.43

6.67

Coarse/Prefix

83.5

96.25

4.24

3.78

93.94

92.97

1.82

3.24

Fine/Prefix

96.32

3.33

2.97

90.56

86.63

6.11

10.4

NO BACKTRACK

Coarse/Infix

75.6

94.62

2.87

3.13

93.44

86.83

3.69

10.03

Fine/Infix

74.5

94.76

3.14

3.56

93.33

78.63

3.53

17.81

Coarse/Prefix

79.7

95.38

7.39

6.57

89.16

83.94

3.45

9.49

Fine/Prefix

74.7

95.23

2.37

2.41

96.44

89.16

1.19

8.43

MEDIUM DEG

Coarse/Infix

92.8

98.26

5.56

6.02

86.11

79.52

8.33

14.46

Fine/Infix

83.4

96.12

6.63

5.56

89.76

83.33

3.61

11.11

Coarse/Prefix

87.7

96.82

4.07

3.57

93.5

90.71

2.44

5.71

Fine/Prefix

90.6

97.92

8.51

7.55

89.36

87.74

2.13

4.72

MEDIUM TERMS

Coarse/Infix

72.7

93.99

25.64

24.18

73.26

69.72

1.1

6.1

Fine/Infix

75.1

95.42

21.29

20.51

75.1

69.94

3.61

9.55

Coarse/Prefix

76.3

95.78

7.59

8.71

88.61

84.67

3.8

6.62

Fine/Prefix

74.8

95.55

14.68

16.76

79.76

74.28

5.56

8.96

Table 8: Errors for 1 variable in the Coarse and Fine configuration for both Infix and Prefix input representation. (No curriculum, No annotation).

Config

Proof Type

Full Proof

Error Percentage

Full Proof

Accuracy

Greedy

Stepwise

Accuracy

First

FacStep

Total

FacStep

First

MulStep

Total

MulStep

First

SumStep

Total

SumStep

SMALL COEFF

Coarse/Infix

87.9

97.01

5.79

4.9

88.43

79.72

5.79

15.38

Fine/Infix

88.1

97.19

8.4

7.98

75.63

68.1

15.97

23.93

Coarse/Prefix

91.2

98.08

1.14

1.03

88.64

84.54

10.23

14.43

Fine/Prefix

90.3

97.83

8.25

6.35

80.41

73.02

11.34

20.63

MEDIUM COEFF

Coarse/Infix

88.5

97.35

4.35

3.73

83.48

76.87

12.17

19.4

Fine/Infix

82.2

96.25

2.25

1.83

76.4

68.81

21.35

29.36

Coarse/Prefix

84.5

96.03

3.87

3.68

88.39

81.58

7.74

14.74

Fine/Prefix

72.4

93.6

12.68

9.95

76.09

67.74

11.23

22.31

LARGE COEFF

Coarse/Infix

80.4

95.18

6.12

4.84

82.65

75.81

11.22

19.35

Fine/Infix

93.85

11.85

8.74

70.74

62.3

17.41

28.96

Coarse/Prefix

83.7

96.23

4.29

3.61

87.12

82.99

8.59

13.4

Fine/Prefix

78.6

95.6

5.14

4.2

81.31

74.43

13.55

21.37

NO BACKTRACK

Coarse/Infix

72.7

93.13

4.4

3.15

87.55

75.79

8.06

21.07

Fine/Infix

72.5

94.64

3.27

2.54

85.09

73.79

11.64

23.66

Coarse/Prefix

63.2

89.87

3.26

2.24

91.3

78.73

5.43

19.03

Fine/Prefix

60.7

90.48

2.29

1.58

89.31

72.64

8.4

25.79

MEDIUM DEG

Coarse/Infix

80.5

95.13

6.67

5.44

81.54

71.97

11.79

22.59

Fine/Infix

73.5

94.21

7.17

6.55

68.3

57.83

24.53

35.61

Coarse/Prefix

83.4

96.41

4.82

4.19

81.33

75.39

13.86

20.42

Fine/Prefix

73.6

94.57

7.58

6.38

75.38

67.48

17.05

26.14

MEDIUM TERMS

Coarse/Infix

92.03

19.05

72.5

66.5

2.5

14.45

Fine/Infix

92.98

13.61

8.62

79.44

69.59

6.94

21.79

Coarse/Prefix

67.8

93.58

10.25

7.69

87.89

80.98

1.86

11.32

Fine/Prefix

62.9

92.74

9.16

5.97

85.71

74.37

5.12

19.65

Table 9: Errors for 2 variables in the Coarse and Fine configuration for both Infix and Prefix input representation. (No curriculum, No annotation).

Appendix E Symbolic Proof and Hyperparameter Tuning (Additional Results)

We present the step-wise error analysis of the best models from hyper-parameter tuning experiment in Table 10. For comparison, we also include the errors made by the best models in the symbolic calculator setting for corresponding configurations.

Hyperparameter Tuning Observations

$\bullet$ For learning rates (lr), we first started with the range $0.01,0.005,0.0005$ and experimented with few configurations, Small Coeff and Medium Terms. lr greater than $0.0005$ resulted in zero validation scores. Later, in our experiments, we settled with the range $0.0001,0.0005,0.00001$ . Similarly, lr $10^{-5}$ took a long time to converge and validation score started oscillating. Dropout choices did not show any particular advantage over the other. For most cases, configuration with batch size 64 showed dominant results.
$\bullet$ Apart from a limited number of settings, coarse proof granularity resulted in the best model. This is expected as fine proof creates long proofs.
$\bullet$ Contrary to the observation made in the Transformers-S configuration, infix representation consistently improved over prefix. This is observed for the symbolic calculator setting as well.

Model

Info

Full Proof

Acc

Stepwise

Acc

First

FacSimp

Total

FacSimp

First

MulSimp

Total

MulSimp

First

SumSimp

Total

SumSimp

Large Coeff

h/prefix

Emb: 512,

d/o: 0.5, bs: 64

Tr-L, Coarse

78.8

95.0

8.96

9.4

90.09

88.03

0.94

2.56

h/infix

Emb: 256,

d/o: 0, bs: 32

Tr-L, Fine

84.6

96.79

5.19

5.68

94.81

92.05

0.00

2.27

cal/prefix

Emb: 512,

d/o: 0.5, bs: 64

Tr-L, Coarse

94.6

98.98

18.52

18.97

79.63

79.31

1.85

1.72

cal/infix

Emb: 512,

d/o: 0.5, bs: 64

Tr-L, Coarse

95.7

99.16

4.65

6.25

83.72

79.17

11.63

14.58

No Backtrack

h/prefix

Emb: 256,

d/o: 0, bs: 64

Tr-L, Coarse

74.1

93.69

1.93

1.52

96.91

87.54

1.16

10.94

h/infix

Emb: 512,

d/o: 0, bs: 64

Tr-L, Fine

82.6

96.85

5.17

6.85

91.95

84.47

2.87

8.68

cal/prefix

Emb: 512,

d/o: 0.5, bs: 32

Tr-L, Fine

83.5

97.25

11.52

13.24

86.67

81.74

1.82

5.02

cal/infix**

Emb: 512,

d/o: 0.5, bs: 64

Tr-S, Fine

87.2

97.94

7.81

10.98

86.72

80.49

5.47

8.54

Medium Degree

h/prefix

Emb: 256,

d/o: 0.5, bs: 32

Tr-S, Coarse

95.49

11.11

10.05

85.56

84.42

3.33

5.53

h/infix

Emb: 512,

d/o: 0, bs: 64

Tr-L, Fine

90.8

97.81

2.17

1.79

90.22

83.93

7.61

14.29

cal/prefix

Emb: 512,

d/o: 0.5, bs: 64

Tr-L, Coarse

90.5

97.89

10.53

9.65

83.16

78.95

6.32

11.40

cal/infix

Emb: 512,

d/o: 0.5, bs: 64

Tr-L, Coarse

90.5

98.08

5.26

4.81

91.58

87.5

3.16

7.69

Medium Terms

h/prefix

Emb: 512,

d/o: 0, bs: 128

Tr-L, Fine

76.5

95.59

20.0

22.45

71.49

66.47

8.51

11.08

h/infix

Emb: 512,

d/o: 0, bs: 32

Tr-S, Fine

79.5

96.49

18.54

17.58

80.49

77.29

0.98

5.13

cal/prefix

Emb: 512,

d/o: 0.5, bs: 32

Tr-L, Coarse

88.3

98.35

9.4

10.85

88.03

86.82

2.56

2.33

cal/infix

Emb: 256,

d/o: 0.5, bs: 128

Tr-L, Fine

89.5

98.45

20.0

25.00

74.29

66.91

5.71

8.09

Table 10: Error analysis for 1 variable, hyperparmeter-tuning and symbolic calculator experiments. h/prefix and h/infix rows denote the error analysis of models with best validation scores after the hyper-parameter search for prefix and infix input representation respectively. Similarly for cal/prefix and cal/infix provides the error analysis for the calculator setting for prefix and infix for the best models after hyperparameter search. For each row, we also show the best model and hyper-parameter configuration. For all configurations, learning rate

10^{-4}

produced the best validation scores. **Some runs for the cal/infix No Backtrack has not finished. We will update the final numbers during publication.

Appendix F Annotated Proof

In each step, simplification is performed over a sub-expression of the polynomial. To check explicitly, if the system can locate the sub-expression and find the type of simplification step, we devise the annotated proof setting. For each simplification step, we add an intermediate step, in which the model annotates the part of polynomial to operate on. For example, the starting input sequence is “ $\text{MARK}~{}\$~{}(5*x_{1}^{2}+x_{1}*x_{2})*(3*x_{1})*(2)$ ”; and the corresponding expected output sequence is “ $\text{MUL}~{}\$~{}\#(5*x_{1}^{2}+x_{1}*x_{2})*(3*x_{1})\#*(2)$ ”. Each sequence has two parts: 1) the step index to perform (Mark, Mul, Fac, Sum), and 2) the polynomial expression. For Mark step, a marker token (#) is used to annotate the candidate sub-expression to be simplified next.

We experiment only with infix representation. The results for 1 variable and 2 variables are in Table 11 and 12. The errors per step type are shown in Tables 13 and 14. Compared to non-annotated setting, while the step-wise accuracy is similar, the proof accuracy suffers often by 7-10%. A reason for such decrease in accuracy is that length of the annotated proofs are twice as long as non-annotated. However, we note that the errors in Mark step are the lowest compared to other types of steps. This indicates that the models are able to learn the candidate sub-expression for simplification, and predict the next operation correctly.

Polynomial Config

Proof/Input Format

Endpoint

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

#EE

Endpoint

Acc

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

SMALL COEFF

Fine

2.4M

88.5

98.82

86.77

87.7

83.97

99.96

96.73

0.98

Coarse

3.7M

91.9

99.16

90.07

90.73

88.13

100

97.85

0.99

MEDIUM COEFF

Fine

4.1M

91.2

2.8M

78.6

97.66

92.67

93.63

88.23

100

95.22

0.98

Coarse

3.5M

84.2

98.29

94.83

95.53

92.4

99.96

97.4

0.99

LARGE COEFF

Fine

4.8M

83.73

3.6M

75.5

97.37

96.8

97.8

92.4

99.93

95.39

0.98

Coarse

4.6M

80.3

97.86

80.37

81.6

77.83

100

96.85

0.98

NO BACK TRACK

Fine

5.9M

80.1

4.1M

96.78

90.43

92.33

84.5

99.96

93.4

0.97

Coarse

3.6M

59.7

92.5

94.33

86.47

99.81

93.3

0.96

MEDIUM DEG

Fine

9.2M

96.4

3.7M

97.37

83.67

85.23

79.1

100

94.54

0.97

Coarse

3.4M

78.7

97.38

93.2

94.37

88.2

100

94.64

0.97

MEDIUM TERMS

Fine

4.6M

81.9

3.6M

70.4

97.48

91.5

92.2

86.87

100

94.94

0.97

Coarse

3.3M

66.2

96.34

88.9

90.27

83.17

99.84

93.4

0.97

Table 11: Results for fine and coarse configurations for 1 Variable for annotated proofs

Polynomial Config

Proof/Input Format

Endpoint

#Train

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

#EE

Endpoint

Acc

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

SMALL COEFF

Fine

4.3M

94.7

3.6M

82.3

97.93

86.47

87.5

81.83

100

94.64

0.97

Coarse

5.1M

98.31

93.5

94.03

90.27

100

96.54

0.98

MEDIUM COEFF

Fine

95.3

5.4M

78.8

97.78

93.8

94.5

90.2

99.93

96.09

0.98

Coarse

80.1

97.69

89.37

90.27

86.77

99.96

97.05

0.98

LARGE COEFF

Fine

91.8

4.1M

70.1

96.59

84.8

86.63

77.77

99.83

91.55

0.96

Coarse

73.2

96.66

92.77

93.8

87.23

100

94.04

0.97

NO BACKTRACK

Fine

8.6M

83.8

3.5M

46.5

92.93

84.9

87.67

74.5

99.96

87.71

0.93

Coarse

6.7M

65.5

95.7

67.8

69.37

63.3

99.79

93.17

0.96

MEDIUM DEG

Fine

4.9M

87.9

3.9M

59.6

95.28

94.13

95.7

86.4

100

91.78

0.96

Coarse

4.1M

65.1

95.61

85.43

87.43

78.2

99.96

91.49

0.96

MEDIUM TERMS

Fine

8.5M

4.8M

56.9

95.7

92.4

93.83

85.77

99.88

92.71

0.96

Coarse

4.2M

52.8

94.57

85.93

75.93

99.82

90.24

0.95

Table 12: Results for fine and coarse configurations for 2 Variables for annotated proofs (No curriculum).

Config

Proof Type

Full Proof

Error Percentage

Full Proof

Accuracy

Greedy

Stepwise

Accuracy

First

FacStep

Total

FacStep

First

MulStep

Total

MulStep

First

SumStep

Total

SumStep

First

MarkStep

Total

MarkStep

SMALL COEFF

Fine

88.5

98.82

3.48

2.99

89.57

83.58

6.09

11.19

0.87

2.24

Coarse

91.9

99.16

1.23

1.19

98.77

96.43

1.19

MEDIUM COEFF

Fine

78.6

97.66

18.69

15.19

74.77

74.44

3.27

6.67

3.27

3.7

Coarse

84.2

98.29

4.43

4.65

84.81

84.88

6.33

5.81

4.43

4.65

LARGE COEFF

Fine

75.5

97.37

11.43

9.21

72.65

66.35

10.61

19.68

5.31

4.76

Coarse

80.3

97.86

5.58

5.86

90.86

87.39

1.02

4.5

2.54

2.25

NO BACKTRACK

Fine

96.78

7.19

6.46

86.56

78.54

5.62

12.71

0.62

2.29

Coarse

59.7

6.2

5.25

88.09

76.88

3.72

15.41

1.99

2.45

MEDIUM DEG

Fine

97.37

11.67

10.85

82.5

80.34

3.75

6.78

2.08

2.03

Coarse

78.7

97.38

6.1

5.84

86.38

81.32

4.69

9.73

2.82

3.11

MEDIUM TERMS

Fine

70.4

97.48

16.89

16.27

69.14

3.72

8.85

4.39

5.74

Coarse

66.2

96.34

25.44

25.28

68.05

63.48

2.37

5.81

4.14

5.43

Table 13: Errors for fine and coarse configurations for 1 Variable for annotated proofs (No curriculum).

Config

Proof Type

Full Proof

Error Percentage

Full Proof

Accuracy

Greedy

Stepwise

Accuracy

First

FacStep

Total

FacStep

First

MulStep

Total

MulStep

First

SumStep

Total

SumStep

First

MarkStep

Total

MarkStep

SMALL COEFF

Fine

82.3

97.93

4.52

3.07

86.44

68.97

7.34

24.14

1.69

3.83

Coarse

98.31

1.68

88.67

78.21

18.44

1.33

1.68

MEDIUM COEFF

Fine

78.8

97.78

8.96

6.79

80.19

68.21

9.43

22.5

1.42

2.5

Coarse

80.1

97.69

9.05

7.79

87.94

80.33

3.02

10.66

1.23

LARGE COEFF

Fine

70.1

96.59

13.38

69.9

59.32

13.38

25.45

3.34

5.23

Coarse

73.2

96.66

10.45

7.84

79.85

70.87

7.84

18.49

1.87

2.8

NO BACKTRACK

Fine

46.5

92.93

9.16

5.15

74.21

57.9

14.58

33.69

2.06

3.25

Coarse

65.5

95.7

3.19

2.61

90.14

77.31

5.22

18.27

1.45

1.81

MEDIUM DEG

Fine

59.6

95.28

7.43

5.48

72.03

57.1

15.35

32.9

5.2

4.52

Coarse

65.1

95.61

6.88

5.26

78.51

67.79

11.46

24.63

3.15

2.32

MEDIUM TERMS

Fine

56.9

95.7

21.58

13.3

67.29

57.72

8.58

23.96

2.55

5.02

Coarse

52.8

94.57

23.94

15.6

68.22

62.77

3.6

16.67

4.24

4.96

Table 14: Errors for fine and coarse configurations for 2 Variable for annotated proofs (No curriculum).

Appendix G Out-of-Distribution Evaluation

We present the results for Out-of-Distribution evaluation here. Table 15 contains results for best 2 variable models (Prefix/Coarse) tested on 1 Variable setting.
Table 16 contains results for best 1 variable models (Prefix/Coarse) tested on Small, medium and Large coefficient setting. As expected, the Small and medium models perform much worse when tested on higher coefficients.
We also evaluated the best 1 variable models (Prefix/Coarse) on medium degree and terms settings, to check generalization with respect to # terms and degree of polynomial. Table 17 contains results for the same. The Medium Coeff model is not able to generalize to more terms or polynomials of higher degree.

Config

Train/Test= 2 Var/1 Var

Train/Test= 1 Var/1 Var

Train/Test= 2 Var/2 Var

Full

Proof Acc.

Greedy

Stepwise Acc.

Full

Proof Acc.

Greedy

Stepwise Acc.

Full

Proof Acc.

Greedy

Stepwise Acc.

SMALL

COEFF

95.34

99.12

95.3

98.97

91.2

98.08

MEDIUM

COEFF

87.4

97.11

93.6

98.58

84.5

96.03

LARGE

COEFF

89.4

97.13

83.5

96.25

83.7

96.23

NO BACK

TRACK

84.2

98.29

79.7

95.38

63.2

89.87

MEDIUM

DEG

87.7

97.83

87.7

96.82

83.4

96.41

MEDIUM

TERMS

78.5

96.16

76.3

95.78

67.8

93.58

Table 15: Results for OOD Testing. NVAR = 2 Coarse/Prefix models tested on corresponding NVAR = 1 setting (No curriculum, No annotation).

Train Config

Test Config

SMALL

COEFF

MEDIUM

COEFF

LARGE

COEFF

Full

Proof

Acc.

Greedy

Stepwise

Acc.

Full

Proof

Acc.

Greedy

Stepwise

Acc.

Full

Proof

Acc.

Greedy

Stepwise

Acc.

SMALL

COEFF

95.3

98.97

33.4

69.05

68.02

MEDIUM

COEFF

96.6

99.29

93.6

98.58

33.6

68.96

LARGE

COEFF

95.8

99.1

94.4

98.64

83.5

96.25

Table 16: OOD Testing: Prefix/Coarse 1 Variable Models tested on various coefficient limit configurations (Small, Medium and Coarse). (No curriculum, No annotation).

Train Config

Test Config

MEDIUM

COEFF

MEDIUM

DEG

MEDIUM

TERMS

Full

Proof

Acc.

Greedy

Stepwise

Acc.

Full

Proof

Acc.

Greedy

Stepwise

Acc.

Full

Proof

Acc.

Greedy

Stepwise

Acc.

MEDIUM

COEFF

93.6

98.58

20.8

47.77

26.1

54.65

MEDIUM

DEG

94.8

98.93

87.7

96.82

25.5

54.39

MEDIUM

TERMS

92.7

96.87

18.6

46.97

76.3

95.78

Table 17: OOD Testing: Prefix/Coarse 1 Variable Models tested on various #term and degree configurations (Medium degree and Medium terms). (No curriculum, No annotation).

Appendix H Curriculum Learning

Learning the simplification steps should entail learning the sub-tasks, such as addition and multiplication (of numeric coefficients and symbolic variables); where multiplying variables precludes learning to add exponents of similar variables. As these sub-tasks are well-defined and dependencies among them are clear, we explore different types of curriculums based on the Mastering-Rate-based (MR) curriculum learning algorithm proposed in Willems et al. (2020). Authors in Willems et al. (2020) define curriculum learning by 1) a curriculum i.e. a set of tasks $\mathcal{C}=\{c_{1},\ldots,c_{n}\}$ , where a task is set of examples of similar type with a sampling distribution, and 2) a program which for each training step defines the tasks to train the learner given its learning state and the curriculum. Formally, the program $d:\mathbb{N}\rightarrow\mathcal{D}^{\mathcal{C}}$ , is a sequence of distributions over $\mathcal{C}$ . The authors estimate the program function through an attention function which defines attention over the tasks at a time-step, and an attention-to-distribution converter which converts the attention to a distribution over $\mathcal{C}$ . Authors observe that other algorithms (Matiisen et al., 2019; Graves et al., 2017) are special cases of the above setting with different choices for program.

To learn on tasks that are learnable but not learnt yet, authors define an ordered curriculum $\mathcal{O}^{\mathcal{C}}$ which is a directed graph over tasks in $\mathcal{C}$ . An edge from A to B indicates that learning task A before B is preferable. For supervised learners, the learnability for each task depends on mastering rate ( $\mathcal{M}_{c}(t)$ ) computed from the normalized mean accuracy for that task at time-step $t$ . At each time-step, the MR algorithm computes attention over a task ( $a_{c}(t)$ ) from mastering rates of its ancestors and successors. During training to sample batches, a hyperparameter $N_{b}$ for the curriculum determines the number of batches to be considered at a step, before re-computing the attention over tasks. Using the program $d$ , we first sample $N_{b}*b$ examples from tasks in $\mathcal{C}$ . The model is then trained on randomly sampled $N_{b}$ minibatches are sampled updating the mastering rates.

For polynomial simplification for 1 variable, we define the following tasks Add, Mul2, Mul3, Scoeff and Mixed. For Add, only one factor per product is allowed, so there is no multiplication. For Mul2 and Mul3 only 1 product is allowed with maximum two factors and three factors respectively. Scoeff points to the Small Coeff configuration and Mixed is the final variable size configuration of the target variable configuration. We define the following curriculums:

•

C: {(Add, Mul3), (Mul3, Mixed), (Add, Mixed)}.
•

C2: {(Add, Mul2), (Mul2, Mul3), (Mul3, Mixed), (Add, Mixed)}.
•

C4: {(Add, Mul2), (Mul2, Mul3), (Mul3, Scoeff), (Add, Scoeff) (Scoeff, Mixed)}.

For all our experiments, we use the MR algorithm with gAmax Linreg A2D converter functions described in Willems et al. (2020). Model parameters and the training configurations remains the same as before⁸⁸8We use $N_{b}$ as 10. For other default parameters in CL, please check github.com/lcswillems/automatic-curriculum.. We show the results in Table 18 for Coarse configuration. As coefficient size grows from Small, Medium, Large to No Backtrack - the improvements in full proof accuracy steadily increase from $1\%$ to $10.84\%$ . For No Backtrack, the improvement in top-1 accuracy is by $20\%$ from a no curriculum setting. However, we observe for Medium Terms, there is a drop in accuracy for all curriculums and input representations. It is possible that, more carefully designed curriculums may improve the results. There is no conceivable pattern observed for infix or prefix representations. However, compared to learning without curriculum, the improvement observed for infix representation is larger than prefix.

Full Proof (Beam-1)

Step-wise (Beam-5)

Calibration (Beam-5)

Curri

culum

#Train

Full Proof

Accuracy

Stepwise

Accuracy

Top-1

Acc

Beam-5

Acc

Sure

Rate

2.8M

94.38

98.76

94.84

96.68

89.36

100

94.22

0.97

Infix

95.98

99.0

91.64

93.24

86.16

99.9

93.98

0.97

2.02M

94.26

98.65

77.76

80.46

70.62

99.94

90.77

0.95

Small Coeff

Prefix

2.29M

94.6

98.56

93.44

95.28

88.02

99.89

94.09

0.97

3.9M

95.44

99.02

94.86

96.44

91.18

100

96.12

0.98

Infix

93.86

98.59

88.22

90.24

84.68

99.91

95.90

0.98

3.7M

94.78

98.82

91.98

93.66

88.08

99.93

95.69

0.98

Medium Coeff

Prefix

4.4M

94.8

98.87

85.3

87.82

80.62

99.98

94.49

0.97

6.9M

91.26

97.92

96.4

98.06

90.24

99.89

93.51

0.97

Infix

7.6M

91.62

98.16

91.54

93.3

87.38

99.84

95.3

0.98

6.5M

92.2

98.31

85.38

87.78

81.42

99.95

95.32

0.98

Large Coeff

Prefix

6.97M

92.46

98.42

91.3

93.34

87.54

100.0

95.88

0.98

4.8M

86.44

97.27

93.68

95.46

88.72

99.98

94.68

0.97

Infix

5.1M

85.96

97.21

94.64

96.1

89.5

100

94.57

0.97

86.16

97.30

82.24

84.44

77.46

99.95

94.14

0.97

No Backtrack

Prefix

5.5M

86.48

97.45

92.6

94.3

87.78

99.95

94.75

0.97

3.5M

87.12

97.01

84.16

87.44

78.46

99.95

93.18

0.96

Infix

3.4M

94.12

98.65

90.62

81.984

86.66

99.93

95.56

0.98

5.35M

94.28

98.71

80.8

82.84

75.76

100

93.51

0.97

Medium Degree

Prefix

3.5M

92.38

98.30

83.7

85.48

78.94

99.92

94.24

0.97

4.4M

59.54

75.76

65.6

69.56

60.84

95.36

88.45

0.92

Infix

3.8M

56.94

76.72

69.84

73.44

60.76

97.5

84.82

0.91

2.8M

41.84

51.24

40.62

45.36

36.9

92.57

84.10

0.88

Medium Terms

Prefix

3.37M

49.02

65.41

58.56

64.64

45.44

96.83

75.14

0.85

Table 18: Curriculum Learning results for 1 variable for the Coarse configuration for both Infix and prefix representations.