GINN-LP: A Growing Interpretable Neural Network for Discovering Multivariate Laurent Polynomial Equations

Nisal Ranasinghe¹, Damith Senanayake¹, Sachith Seneviratne^1,2, Malin Premaratne³, Saman Halgamuge¹

Abstract

Traditional machine learning is generally treated as a black-box optimization problem and does not typically produce interpretable functions that connect inputs and outputs. However, the ability to discover such interpretable functions is desirable. In this work, we propose GINN-LP, an interpretable neural network to discover the form and coefficients of the underlying equation of a dataset, when the equation is assumed to take the form of a multivariate Laurent Polynomial. This is facilitated by a new type of interpretable neural network block, named the “power-term approximator block”, consisting of logarithmic and exponential activation functions. GINN-LP is end-to-end differentiable, making it possible to use backpropagation for training. We propose a neural network growth strategy that will enable finding the suitable number of terms in the Laurent polynomial that represents the data, along with sparsity regularization to promote the discovery of concise equations. To the best of our knowledge, this is the first model that can discover arbitrary multivariate Laurent polynomial terms without any prior information on the order. Our approach is first evaluated on a subset of data used in SRBench, a benchmark for symbolic regression. We first show that GINN-LP outperforms the state-of-the-art symbolic regression methods on datasets generated using 48 real-world equations in the form of multivariate Laurent polynomials. Next, we propose an ensemble method that combines our method with a high-performing symbolic regression method, enabling us to discover non-Laurent polynomial equations. We achieve state-of-the-art results in equation discovery, showing an absolute improvement of 7.1% over the best contender, by applying this ensemble method to 113 datasets within SRBench with known ground-truth equations.

Introduction

Although supervised machine learning (ML) can create arbitrary mappings between a large number of input variables and target variables, typically, there is no ability to discover an interpretable mathematical expression between the two variable spaces. However, in cases where the recovery of an underlying governing equation is required, this lack of interpretability is an obstacle. Interpretable machine learning has the potential to harness data to discover underlying equations that govern them.

In this work, we focus on the data-driven discovery of equations that take the form of multivariate Laurent polynomials (LP). LPs produce equations important in physics and real-world systems, with some examples shown in Table 1.

Phenomenon	Equation
Coulomb’s Law	$E_{f}=\frac{1}{4\pi\epsilon}\frac{q_{1}q_{2}}{r^{2}}$
Kinetic Energy	$K=\frac{m}{2}(u^{2}+v^{2}+w^{2})$
Potential Energy (Gravity)	$U=Gm_{1}m_{2}(\frac{1}{r_{2}}-\frac{1}{r_{1}})$

Table 1: Three example equations as multivariate LPs and their relevant phenomena

More formally, we aim to discover a multivariate LP that can accurately map features $\textbf{X}\in\mathbb{R}^{d}$ to a target $y\in\mathbb{R}$ using a dataset of paired samples for X and $y$ . An exhaustive search is tractable to solve this problem when the search space is small but quickly becomes intractable as the number of variables and possible terms increases, as demonstrated in the appendix¹¹1Appendix available in https://arxiv.org/abs/2312.10913.

In this work, we present GINN-LP, a Growing Interpretable Neural Network for discovering multivariate Laurent Polynomials, which can efficiently discover the form and coefficients of a multivariate LP equation that describes a given dataset, without explicitly searching through the equation space.

The main contributions of this work are as follows.

•

A new type of interpretable neural network (NN) block named the “power-term approximator” (PTA).
•

GINN-LP: An interpretable neural network architecture that uses multiple PTA blocks to discover multivariate LPs using observed data.
•

A neural network growth strategy allowing the automatic discovery of the number of terms in the underlying polynomial while reducing overfitting and training time.
•

A model training strategy that can effectively train the interpretable neural network and decide the best trade-off between accuracy and model simplicity.

Preliminaries

Here, we describe the preliminaries in formulating our rationale.

Definition: A multivariate LP has the form

P=\sum_{i=1}^{n}c_{i}\prod_{j=1}^{k}V_{j}^{p_{i}(j)}

(1)

where $c_{i}\in\mathbb{R}$ is the i-th term of the polynomial and $p_{i}(j)\in\mathbb{Z}$ is the power of the j-th variable in that term. The set of variables $V_{j}\in\{V_{1},...,V_{k}\}$ is the finite set of variables considered in the LP.

To the best of our knowledge, GINN-LP is the first method which can discover governing equations with arbitrary multivariate LP terms. The NN is end-to-end differentiable and therefore, can be trained using backpropagation. A neural network growth strategy is implemented to automatically discover the correct number of terms in the underlying equation. The model is trained on a dataset consisting of $(\mathbf{x_{i}},y_{i})$ pairs (input-output) where $0\leq i\leq N$ and N is the number of data points in the dataset.

In the following section, we review the existing literature on related methods. Then, we introduce GINN-LP, our interpretable NN that can discover multivariate LPs describing datasets. The results of the experiments conducted are then presented, and the final section provides a summary of this work, its limitations and some future research directions.

Related Work

Data-driven discovery of equations has been mostly dominated by genetic programming (GP) methods in the early literature, exemplified by the fact that symbolic regression (SR) was first introduced as a GP problem in (John R. Koza 1992). However, with the increased interest in explainable AI, a variety of methods have been developed in the recent past (La Cava et al. 2021).

Methods for symbolic regression: Symbolic regression is an area of research focused on discovering equations that describe given datasets, by efficiently searching through the space of possible equations.

A well-known method named AIFeynman discovers simplifying properties (e.g. symmetry, separability) of real-world equations using a NN to greatly reduce the search space till the search space becomes tractable (Udrescu and Tegmark 2020; Udrescu et al. 2020). Another class of SR methods use sparse regression such as least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996) to recover a sparse coefficient matrix of a pre-defined equation structure, making the overall search more efficient (Brunton, Proctor, and Kutz 2015; Rudy et al. 2017; Brunton, Proctor, and Kutz 2016; Kaheman, Kutz, and Brunton 2020; Quade et al. 2018; Messenger and Bortz 2021). Sparse regression-based methods for equation recovery have been further improved using Bayesian methods (Zhang and Lin 2018; Jin et al. 2019), allowing the quantification of uncertainty.

Genetic programming (GP): Most early SR methods use genetic programming (Koza 1994; Dubčáková 2011; Schmidt and Lipson 2009). In these algorithms, a population of equations are evolved along multiple iterations till an acceptably accurate equation is found. GP based methods have been successfully used in areas such as microbial modelling (Vidanaarachchi et al. 2020). GP algorithms suffer from the combinatorial nature of the problems, i.e., the search space is vast and a large number of equally suitable equations may be recovered. Additionally, GP algorithms tend to be slower in convergence compared to other optimization methods such as gradient-based methods and therefore do not scale well.

Shallow Neural Networks: A class of equation learning neural networks, called equation learners (EQL) is presented in (Martius and Lampert 2016; Sahoo, Lampert, and Martius 2018). In EQL, the conventional activation functions are replaced by operators such as summation and trigonometric functions which are commonly found in real-world equations. This EQL network is integrated with deep learning architectures for end-to-end training in (Kim et al. 2021). PDE-NET 2.0 (Long, Lu, and Dong 2018) uses a symbolic neural network inspired by (Martius and Lampert 2016) to approximate the non-linear response function of the underlying partial differential equation (PDE) governing a system.

Deep Neural Networks: Some recent works in SR circumvent the uninterpretable nature of deep neural networks (DNN) by using a large model to search the smaller space of symbolic models. Deep symbolic regression (Petersen et al. 2021) proposes the use of a recurrent neural network (RNN) to output a distribution over symbolic expressions. The network is trained using a reinforcement learning method to learn the structure of the expression and the constants are learned using a non-linear optimization method such as the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) (Fletcher 2000).

In some recent deep learning methods, the model is pre-trained on a large collection of symbolic regression datasets with known equations (Valipour et al. 2021; Biggio et al. 2021; Kamienny et al. 2022). This contrasts with the majority of other methods, where the algorithms have to be trained from scratch for each SR dataset. Though they have been shown to be faster in equation discovery, the pre-training time could be quite large.

Performance estimation and benchmarking: The field of symbolic regression lacks a widely agreed-upon benchmarking platform or dataset. A recent work has introduced SRBench, an open-source, reproducible platform for benchmarking symbolic regression methods (La Cava et al. 2021), aiming to establish a standardized framework for evaluating SR methods. SRBench has incorporated over 130 datasets with ground truth equation forms to the Penn Machine Learning Benchmark (PMLB) (Olson et al. 2017), which can be used to evaluate SR methods. The main source of symbolic regression datasets used in SRBench is the Feynman dataset, which originates from the Feynman lecture series on physics (Feynman, Leighton, and Sands 2011). SRBench compares 14 different symbolic regression algorithms using these SR datasets as well as black-box datasets.

Method

Refer to caption — Figure 1: The architecture of the proposed interpretable NN block named the “PTA” block. This block can discover a single term in a multivariate LP. $x_{1},x_{2},...,x_{n}$ are the inputs to the block and $w_{1},w_{2},...,w_{n}$ are weights of the linear activated neuron.

Algorithm 1 GINN-LP training strategy

Input: Dataset of input x, output y pairs
Parameters: Training instance count ( $N=4$ ), model complexity weight ( $\alpha=10^{-6}$ ), $L_{1}$ regularization factor ( $\lambda_{1}=10^{-4}$ ), $L_{2}$ regularization factor ( $\lambda_{2}=10^{-4}$ ), maximum number of PTA blocks ( $P=4$ ), number of epochs per network growth stage ( $E$ ), rounding precision ( $\epsilon)$
Output: An equation that describes the relationship between inputs and outputs

1: for

k_{inst}\leftarrow 1

N

2: model

\leftarrow

InitNetwork(num_PTA=0)

MSE

\leftarrow

\infty

max\_SE_{M}\leftarrow\infty

5: for

k_{PTA}\leftarrow 1

P

6: GrowModel(model)

7: TrainModel(model,

L_{1}

\lambda_{1}

L_{2}

\lambda_{2}

, epochs=

E/2

)

8: TrainModel(model,

L_{1}

0

L_{2}

0

, epochs=

E/2

)

9: RoundExponents(model, round_prec=

\epsilon

)

10:

new\_MSE

= CalcMSE(model, validation_data)

11: if

new\_MSE>MSE*0.8

then

12: break

13: end if

14:

MSE=new\_MSE

15: end for

16:

eq\leftarrow

GetEQ(model)

17:

C_{M}\leftarrow

CalcComplexity(eq)

18:

SE_{M}\leftarrow MSE+\alpha*C_{M}

19: if

SE_{M}>max\_SE_{M}

then

20:

max\_SE_{M}\leftarrow SE_{M}

21:

best\_eq\leftarrow eq

22: end if

23: end for

24: return

best\_eq

In this work, we present GINN-LP, a novel growing interpretable neural network that can discover multivariate LP equations which describe a dataset²²2Source code available in https://github.com/nisalr/ginn-lp. Once trained on a dataset of input-output pairs, the network can produce a LP that fits the data, while accurately predicting whether the LP assumption is valid for the given dataset.

The main component of the proposed architecture is the “power-term approximator” (PTA) block, which is an interpretable NN block consisting of a logarithmic unit and a single linear activated neuron followed by an exponential activation. We illustrate the architecture of a PTA in Figure 1. If $x_{1},x_{2},...,x_{n}$ denote the inputs to the block, the output $y_{d}$ of the linear activated neuron and the output of the network $y$ are given by,

$\displaystyle y_{d}$	$\displaystyle=w_{1}\log(x_{1})+w_{2}\log(x_{2})+...+w_{n}\log(x_{n})$	(2)
	$\displaystyle=\log(x_{1}^{w_{1}}x_{2}^{w_{2}}...x_{n}^{w_{n}})$	(3)
$\displaystyle y$	$\displaystyle=e^{y_{d}}=x_{1}^{w_{1}}x_{2}^{w_{2}}...x_{n}^{w_{n}}$	(4)

where $w_{1},w_{2},...,w_{n}$ are the weights of the linear activated neuron.

Since the PTA can be generalized to support any arbitrary number of inputs, a single block can in theory approximate an arbitrary multivariate LP with a single term.

To enable the network to discover equations with multiple terms, a set of PTA blocks are stacked in parallel, followed by a single neuron with a linear activation as shown in Figure 2. The output is a linear combination of the outputs of all PTA blocks. Each PTA is expected to discover a single term in a polynomial equation, including linear terms. More generally, they can recover equations with multiple additive terms each having an arbitrary number of variables raised to arbitrary powers. To promote the learning of exact equations, we perform a rounding of all coefficients and exponents to the nearest $\epsilon$ after training. $\epsilon$ was empirically set to 0.001 in our experiments.

Training Strategies

In our implementation, GINN-LP is trained to minimize the mean squared error (MSE) using the Adam optimizer and an exponentially decaying learning rate starting from 0.01. Algorithm 1 outlines the training strategies used to train GINN-LP efficiently, along with the empirically determined hyperparameters. We further discuss these strategies below.

Regularization to enforce equation conciseness. In practice, simpler equations are preferred over more complex ones to describe relationships between variables since they are easier to understand and interpret. To enforce this property, we impose sparsity-promoting regularization on our interpretable neural network. Since the values of the weight matrix correspond to the coefficients and power terms of the discovered equation, a sparser weight matrix would result in a simpler equation.

However, using regularization throughout training from beginning to end causes some discovered equations to be slightly different to the ground-truth equation. To avoid this, we introduce an unregularized training phase after regularized training. This allows the equation coefficients to converge towards more accurate values.

During the regularized training phase, a linear combination of $L_{1}$ and $L_{2}$ regularization is applied. Hence, the objective function would be,

\displaystyle L=\frac{1}{N}\sum_{i=1}^{N}(f(\textbf{x})-y_{i})^{2}+\lambda_{1}\sum_{i=1}^{n}|W_{i}|+\lambda_{2}\sum_{i=1}^{n}W_{i}^{2}

(5)

Here, N is the number of samples in the dataset, n is the number of weights in the weight matrix and $\lambda_{1}$ , $\lambda_{2}$ are the regularization constants. When the unregularized phase was not included, we observed that the equation coefficients would sometimes deviate from their true values by small amounts (e.g. 0.001). The point at which the regularization changes is expressed as a fraction of the total number of training epochs. This was empirically set to 0.5.

Neural network growth. Having a higher number of PTA blocks than required could lead to the model overfitting. Therefore, starting from a single PTA block, the number of PTA blocks is increased, thereby growing the GINN-LP architecture. When the network is grown, a new, randomly initialized PTA block is added in parallel without altering the weights of the already trained PTA blocks. Then, the grown network is trained for a fixed number of epochs on the dataset. This is performed iteratively till the network is grown to a pre-defined maximum size, or till an early stopping condition is reached.

Training is stopped if the validation MSE does not decrease by a certain percentage after each growth iteration. This percentage is a hyperparameter of the model and is empirically set to 20% in our experiments. We expect the training to stop when the number of PTA blocks is equal to the number of terms in the equation.

Training multiple instances. When training GINN-LP, the network sometimes gets stuck in local optimums. To ensure a well-fitting model, we train the network multiple times with different random initializations, and the best model is selected as outlined in the following sub-section.

Model selection. When there are multiple models (equations) with similar performances on unseen data, we prefer the simpler model. To select the best model from all training instances, we propose a new performance metric named the symbolic error (SE) using a linear combination of the model complexity, and the validation set MSE. We use the same complexity metric as SRBench (La Cava et al. 2021), where the complexity ( $C_{M}$ ) of a given model $M$ is defined simply as the number of mathematical operators ( $\mathrm{op}_{M}$ ), constants ( $\mathrm{const}_{M}$ ) and features ( $n_{m}$ ) in the equation discovered by the model. The best model $\hat{M}$ is determined as shown below,

$\displaystyle C_{M}$	$\displaystyle=\mathrm{op}_{M}+\mathrm{const}_{M}+n_{M}$	(6)
$\displaystyle SE_{M}$	$\displaystyle=MSE(y,\hat{y}\|\theta_{M})+\alpha*C_{M}$	(7)
$\displaystyle\hat{M}$	$\displaystyle=\operatorname*{argmin}_{M}[SE_{M}]$	(8)

Here, $\theta_{M}$ are the parameters of model $M$ , $y$ is the target, $\hat{y}$ is the predicted value and $\alpha$ is the weighting factor of the model complexity. $\alpha=10^{-6}$ is determined empirically.

Ensemble Method for Discovering Non-Laurent Polynomial Equations

To demonstrate the broader applicability of GINN-LP, we propose an ensemble equation discovery pipeline that combines GINN-LP with high-performing SR methods, to enable the discovery of both LP and non-LP equations, thereby achieving state-of-the-art performance in equation discovery. In this ensemble method, we first fit GINN-LP, and if at least one of the exponents of the discovered equation is not an integer, we determine that the assumption of the ground-truth equation being an LP is invalid. In this case, we fit a secondary SR method to discover an accurate equation. The ensemble pipeline is illustrated in Figure 3 and two hypothetical examples showing the GINN-LP and secondary SR method outputs are given in Table 2.

We perform multiple experiments with different options for the secondary SR method, and show that ensembling with GINN-LP improves the performance of all other SR methods used for evaluation.

GINN-LP Output	LP/Non-LP?	Secondary SR output	Ensemble output
$x_{1}^{2.342}x_{2}^{2}$	Non-LP	$sin(x_{1})/x_{2}$	$sin(x_{1})/x_{2}$
$x_{1}^{2}x_{2}^{-1}$	LP	N/A	$x_{1}^{2}x_{2}^{-1}$

Table 2: Hypothetical examples of the ensemble method output, showing the intermediate outputs.

Results

We evaluate our approach on the Feynman symbolic regression benchmark dataset, using SRBench (La Cava et al. 2021). The Feynman dataset consists of multiple SR datasets generated from real-world equations, along with their corresponding ground-truth equations. We note that our method may not qualify as a symbolic regression method, since we are not explicitly searching through a space of equations. However, due to the similarities in the aim, we evaluate our method against SR methods.

GINN-LP was compared against 14 other popular SR methods in terms of how well it recovers the ground-truth equation. Each SR method was run for five trials per equation dataset to investigate whether they can consistently recover the correct ground truth equation with different random initializations.

All experiments were conducted on a computing cluster where each experiment used a 32-core, 2.90GHz Intel Xeon Gold 6326 CPU and a single NVIDIA A-100 GPU. We note that our approach does not see a major performance improvement when training on a GPU, since the network architecture is small. Moreover, we did not re-run experiments for other methods where results were already available within SRBench ³³3SRBench code available in https://github.com/cavalab/srbench.

Performance Metric

We use the same definition of exact recovery as SRBench (La Cava et al. 2021). Each recovered equation is simplified using SymPy (Meurer et al. 2017) and compared with the corresponding ground-truth equation to determine whether it has been recovered exactly. For each SR method tested, the performance is reported using the symbolic solution recovered percentage, also called the solution rate. This is calculated as,

\displaystyle solution\ rate=\frac{correctly\ recovered\ equations}{total\ number\ of\ equations}\times 100\%

(9)

Since multiple trials (with different random initializations) are performed for each dataset and algorithm, we report a range of the solution rate or the mean/median value. In each trial, the network was trained on 10,000 data points, generated from the ground-truth equation of each of the datasets.

Standalone GINN-LP Performance Results – Recovering LP Equations

We first selected the subset of 48 multivariate LPs within the Feynman dataset to evaluate our method, as GINN-LP is designed to retrieve such equations. The ground-truth equations of these datasets are listed in the appendix.

Comparison of models without added noise. We compare GINN-LP with 14 popular symbolic regression algorithms using SRBench. The results of the experiments without adding any noise to the datasets are shown in Figure 4 for the five top-performing algorithms. Our method outperforms all the 14 algorithms it was compared against in terms of minimum, maximum and median solution rates. To ensure fairer comparison, experiments were performed by reducing the search space of AIFeynman only to include operators present in LPs, to incorporate the same prior that we are incorporating into our method. However, the performance of AIFeynman did not increase noticeably with this reduced search space.

Impact of adding Gaussian noise. To evaluate the noise-resilience of GINN-LP, experiments were conducted after adding Gaussian white noise to the target as a fraction of the target root mean square value. Experiments were repeated with noise fractions of 0.001, 0.01 and 0.1, on the top five algorithms identified in the previous section (without adding noise) and the results are shown in Figure 5(a). Our method outperforms the other top methods with low noise (0.001) but does not manage to beat the AFP_FE method with high noise values.

Equations that were recovered by our method in the high noise cases, seemed to have recovered the structure with some slight changes to the constants. We hypothesize that the network overfits due to attempting to model the noise.

However, we note GINN-LP is still able to model noisy data quite accurately, even if it’s unable to recover the exact groundtruth equations, as shown in the appendix.

Solution rate vs ground truth equation complexity. Further experiments were conducted to observe how the GINN-LP performance changes with the complexity of the equation we attempt to recover. Here, complexity $C_{M}$ is calculated using the metric defined in Equation 6. The equations in our dataset consisted of complexity values ranging from 3 to 27. To visualize our results, we grouped the complexity values into six bins. The intervals of these bins are $(0,3],(3,6],(6,9],(9,12],(12,15]$ and $(15,+\infty)$ .

We compare the performance of the five top-performing algorithms on each of the six bins. The mean solution rate (across all experiments) for each bin for a particular algorithm is visualized in Figure 5(b). Our method maintains good solution rates across most of the complexity bins, with a moderate solution rate of close to 40% even for the most complex bin, where all other methods fail.

Training with irrelevant inputs. In practice, we might not know which of our input variables are related to the output. We investigate how GINN-LP performs when there are such variables, by adding irrelevant inputs sampled from a uniform distribution within the same range as the other inputs. Experiments were conducted with up to 2 irrelevant inputs and the results are shown in Figure 6(a). We compare our method with AIFeynman, which was the top-performing method (among other methods) in earlier experiments. Our method does not see a significant drop in performance even with 2 irrelevant inputs, while the performance of AIFeynman drops considerably.

Effect of training data size. We compared our method with AIFeynman when trained with different sizes of training datasets and present the results in Figure 6(b). We note that GINN-LP can still recover a majority of the equations even with a small number of samples, and remains competitive with AIFeynman.

Effect of epoch count and number of training instances. Further experiments were done to investigate the effect of these hyperparameters on the performance of GINN-LP. Results of these experiments are provided in the appendix.

Ensemble Method Performance Results

We evaluate the performance of the proposed ensemble method using the entire Feynman symbolic regression benchmark dataset within SRBench. This consists of 113 datasets, including all 48 datasets with LP ground-truths. GINN-LP perfectly identifies when the LP assumption does not hold for a given dataset, as demonstrated by the confusion matrix included in the appendix. This indicates that GINN-LP can successfully identify the problems it cannot solve, enabling them to be solved using a secondary, high-performing SR method. We show the results after ensembling our method with AIFeynman and Multiple Regression Genetic Programming (MRGP) in the below sub-sections, since these two methods show the best performances in terms of their solution rate and predictive accuracy respectively. We further show in the appendix, that ensembling with GINN-LP enhances the performance of all 14 other SR methods.

Solution rate. The results of the proposed ensemble methods in terms of their solution rates are shown in Figure 7. We show that our method ensembled with AIFeynman outperforms state-of-the-art methods by 7.1% in terms of the median solution rate (absolute). Moreover, despite being formulated for recovering LP equations, the standalone GINN-LP still outperforms the majority of other methods on the complete Feynman dataset.

Predictive accuracy. Further experiments were conducted to evaluate how accurately our methods can predict outputs for new, previously unseen data. The prediction accuracy is calculated as the percentage of datasets with $R^{2}$ above 0.99 on the test data. A train-test split of 75-25 is used. We compare the prediction accuracy of all methods on all 113 datasets, including GINN-LP, and the proposed ensemble methods. As shown in Figure 8, all of our methods, including the standalone GINN-LP perform reasonably well while GINN-LP ensembled with MRGP outperforms all other methods by 0.9%, while achieving near perfect $R_{2}>0.99$ accuracy.

Conclusion

In this work, we present GINN-LP, an end-to-end differentiable interpretable neural network that can recover mathematical expressions that take the form of multivariate LPs. This is made possible by a new type of interpretable neural network block we introduce, named the PTA block, that can discover terms of multivariate LPs. Our method can automatically discover the number of terms in the underlying equation using a neural network growth strategy while promoting the discovery of simpler equations by applying sparsity regularization.

The results of the experiments conducted show that our approach achieves the state-of-the-art results, outperforming the best contender in symbolic regression algorithms on datasets with multivariate LP ground-truth equations in the Feynman symbolic regression benchmark dataset. We also show that GINN-LP can successfully identify when a given dataset cannot be accurately described using a multivariate LP. Building on these observations, we propose an ensemble method which uses a secondary SR method to discover equations when GINN-LP identifies that the LP assumption does not hold for a given dataset. Further experiments show that this ensemble method achieves state-of-the-art results in equation discovery, when evaluated on the entire Feynman dataset, including both LP and non-LP equations.

A limitation of the proposed architecture is that it assumes that all input values inside and outside of the training dataset are positive real numbers. This is because of the natural logarithmic activation used within the PTA block. This could be overcome by adding a constant to all input features with negative values. However, this will make it more difficult to discover the mathematical equation since adding a constant to the input could change an LP equation to a more complex equation (e.g. $x_{1}/x_{2}$ changes to $(x_{1}+2)/(x_{2}+2)$ ). Future work may investigate NN architectures that can perform power approximation in the presence of negative input values.

Another interesting future research direction is to extend this architecture to discover a wider variety of equations, without being limited to only LPs. One way to achieve this is to grow the network sequentially, thereby stacking PTA blocks in multiple layers one after the other. However, this would significantly increase the risk of overfitting. Therefore, greater attention will need to be paid to regularizing the weights, in addition to the growth strategy.

Acknowledgements

This study was supported by the Melbourne Graduate Research Scholarship for the first author, the Australian Research Council grant DP210101135 for the first author and the Australian Research Council grant DP220101035 for the second, third and the last authors. This study was supported partially for the third author by the Australian Government through the Australian Research Council’s Centre of Excellence for Children and Families over the Life Course (Project ID CE200100025). This research was also supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative. The authors thank Nadarasar Bahavan, Deshani Poddenige and Rashindrie Perera for proofreading.

References

Arnaldo, Krawiec, and O’Reilly (2014) Arnaldo, I.; Krawiec, K.; and O’Reilly, U.-M. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, 879–886. New York, NY, USA: ACM. ISBN 9781450326629.
Biggio et al. (2021) Biggio, L.; Bendinelli, T.; Neitz, A.; Lucchi, A.; and Parascandolo, G. 2021. Neural Symbolic Regression that Scales. In International Conference on Machine Learning.
Brunton, Proctor, and Kutz (2015) Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2015. Discovering governing equations from data: Sparse identification of nonlinear dynamical systems.
Brunton, Proctor, and Kutz (2016) Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2016. Sparse Identification of Nonlinear Dynamics with Control (SINDYc). In IFAC-PapersOnLine, volume 49, 710–715. Elsevier B.V.
de Franca and Aldeia (2021) de Franca, F. O.; and Aldeia, G. S. 2021. Interaction–transformation evolutionary algorithm for symbolic regression. Evolutionary Computation, 29(3): 367–390.
Dubčáková (2011) Dubčáková, R. 2011. Eureqa: software review. Genetic Programming and Evolvable Machines, 12(2): 173–178.
Feynman, Leighton, and Sands (2011) Feynman, R. P.; Leighton, R. B.; and Sands, M. 2011. The Feynman lectures on physics, Vol. I: The new millennium edition: mainly mechanics, radiation, and heat. Basic Books.
Fletcher (2000) Fletcher, R. 2000. Practical Methods of Optimization. Chichester, West Sussex England: John Wiley & Sons, Ltd. ISBN 9781118723203.
Jin et al. (2019) Jin, Y.; Fu, W.; Kang, J.; Guo, J.; and Guo, J. 2019. Bayesian Symbolic Regression.
John R. Koza (1992) John R. Koza. 1992. Genetic programming - on the programming of computers by means of natural selection. MIT Press. ISBN 0-262-11170-5.
Kaheman, Kutz, and Brunton (2020) Kaheman, K.; Kutz, J. N.; and Brunton, S. L. 2020. SINDy-PI: A robust algorithm for parallel implicit sparse identification of nonlinear dynamics: SINDy-PI. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 476(2242).
Kamienny et al. (2022) Kamienny, P.-A.; d’Ascoli, S.; Lample, G.; and Charton, F. 2022. End-to-end symbolic regression with transformers. In Advances in Neural Information Processing Systems.
Kim et al. (2021) Kim, S.; Lu, P. Y.; Mukherjee, S.; Gilbert, M.; Jing, L.; Ceperic, V.; and Soljacic, M. 2021. Integration of Neural Network-Based Symbolic Regression in Deep Learning for Scientific Discovery. IEEE Transactions on Neural Networks and Learning Systems, 32(9): 4166–4177.
Kommenda et al. (2020) Kommenda, M.; Burlacu, B.; Kronberger, G.; and Affenzeller, M. 2020. Parameter identification for symbolic regression using nonlinear least squares. Genetic Programming and Evolvable Machines, 21(3): 471–501.
Koza (1994) Koza, J. 1994. Genetic programming as a means for programming computers by natural selection. Statistics and Computing, 4(2).
La Cava et al. (2019a) La Cava, W.; Helmuth, T.; Spector, L.; and Moore, J. H. 2019a. A Probabilistic and Multi-Objective Analysis of Lexicase Selection and $\epsilon$ -Lexicase Selection. Evolutionary Computation, 27(3): 377–402.
La Cava et al. (2021) La Cava, W.; Orzechowski, P.; Burlacu, B.; de França, F. O.; Virgolin, M.; Jin, Y.; Kommenda, M.; and Moore, J. H. 2021. Contemporary Symbolic Regression Methods and their Relative Performance. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
La Cava et al. (2019b) La Cava, W.; Raj Singh, T.; Taggart, J. P.; Suri, S.; and Moore, J. H. 2019b. Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations.
Long, Lu, and Dong (2018) Long, Z.; Lu, Y.; and Dong, B. 2018. PDE-Net 2.0: Learning PDEs from Data with A Numeric-Symbolic Hybrid Deep Network.
Martius and Lampert (2016) Martius, G.; and Lampert, C. H. 2016. Extrapolation and learning equations.
McConaghy (2011) McConaghy, T. 2011. FFX: Fast, Scalable, Deterministic Symbolic Regression Technology. 235–260.
Messenger and Bortz (2021) Messenger, D. A.; and Bortz, D. M. 2021. Learning Mean-Field Equations from Particle Data Using WSINDy.
Meurer et al. (2017) Meurer, A.; Smith, C. P.; Paprocki, M.; Čertík, O.; Kirpichev, S. B.; Rocklin, M.; Kumar, A.; Ivanov, S.; Moore, J. K.; Singh, S.; Rathnayake, T.; Vig, S.; Granger, B. E.; Muller, R. P.; Bonazzi, F.; Gupta, H.; Vats, S.; Johansson, F.; Pedregosa, F.; Curry, M. J.; Terrel, A. R.; Roucka, S.; Saboo, A.; Fernando, I.; Kulal, S.; Cimrman, R.; and Scopatz, A. 2017. SymPy: symbolic computing in Python. PeerJ Computer Science, 3: e103.
Olson et al. (2017) Olson, R. S.; La Cava, W.; Orzechowski, P.; Urbanowicz, R. J.; and Moore, J. H. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(1): 36.
Petersen et al. (2021) Petersen, B. K.; Larma, M. L.; Mundhenk, T. N.; Santiago, C. P.; Kim, S. K.; and Kim, J. T. 2021. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. In International Conference on Learning Representations.
Quade et al. (2018) Quade, M.; Abel, M.; Nathan Kutz, J.; and Brunton, S. L. 2018. Sparse identification of nonlinear dynamics for rapid model recovery. Chaos, 28(6).
Rudy et al. (2017) Rudy, S. H.; Brunton, S. L.; Proctor, J. L.; and Kutz, J. N. 2017. Data-driven discovery of partial differential equations. Science Advances, 3(4).
Sahoo, Lampert, and Martius (2018) Sahoo, S. S.; Lampert, C. H.; and Martius, G. 2018. Learning Equations for Extrapolation and Control.
Schmidt and Lipson (2009) Schmidt, M.; and Lipson, H. 2009. Distilling Free-Form Natural Laws from Experimental Data. Technical Report 5923.
Schmidt and Lipson (2010) Schmidt, M. D.; and Lipson, H. 2010. Age-fitness pareto optimization. In Proceedings of the 12th annual conference on Genetic and evolutionary computation - GECCO ’10, 543. New York, New York, USA: ACM Press. ISBN 9781450300728.
Tibshirani (1996) Tibshirani, R. 1996. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1): 267–288.
Udrescu et al. (2020) Udrescu, S.-M.; Tan, A.; Feng, J.; Neto, O.; Wu, T.; and Tegmark, M. 2020. AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity.
Udrescu and Tegmark (2020) Udrescu, S.-M.; and Tegmark, M. 2020. AI Feynman: A physics-inspired method for symbolic regression. Technical report.
Valipour et al. (2021) Valipour, M.; You, B.; Panju, M.; and Ghodsi, A. 2021. SymbolicGPT: A Generative Transformer Model for Symbolic Regression.
Vidanaarachchi et al. (2020) Vidanaarachchi, R.; Shaw, M.; Tang, S.-L.; and Halgamuge, S. 2020. IMPARO: inferring microbial interactions through parameter optimisation. BMC Molecular and Cell Biology, 21(S1): 34.
Virgolin, Alderliesten, and Bosman (2019) Virgolin, M.; Alderliesten, T.; and Bosman, P. A. N. 2019. Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In Proceedings of the Genetic and Evolutionary Computation Conference, 1084–1092. New York, NY, USA: ACM. ISBN 9781450361118.
Virgolin et al. (2021) Virgolin, M.; Alderliesten, T.; Witteveen, C.; and Bosman, P. A. 2021. Improving Model-Based Genetic Programming for Symbolic Regression of Small Expressions. Evolutionary computation, 29(2): 211–237.
Zhang and Lin (2018) Zhang, S.; and Lin, G. 2018. Robust data-driven discovery of governing physical laws with error bars. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2217).

Appendix

Appendix a Exponential search space of polynomial equations

We demonstrate the intractable search space of multivariate polynomial structure by calculating the number of terms and possible structures for polynomials of a given order (n) and a given number of variables (k).

The number of terms $T$ can be calculated by,

T=\binom{n+k}{n}

(10)

The number of unique structures (S) of a polynomial with T terms is given by the number of unique subsets of these T terms. S is given by,

S=2^{T}

(11)

We illustrate this in Figure 9 below.

Appendix b Identifying whether the ground-truth can be described using a Laurent polynomial

GINN-LP can accurately identify whether a given dataset can be described accurately using a LP. We train GINN-LP on all 113 datasets of the Feynman symbolic regression benchmark, within SRBench and classify them as LP or Non-LP. This is done by analyzing the exponents of the discovered equation. If all exponents are integers, we classify an equation as a LP. We compare the GINN-LP classification with the ground-truth equations and plot the confusion matrix in Figure 10. It was observed that there were zero False positives, demonstrating that when the LP assumption is invalid (the ground-truth is non-LP), GINN-LP can perfectly identify it.

Appendix c Laurent polynomial equations used for evaluating the model

The equations within SRBench (La Cava et al. 2021) used to evaluate our method were taken from the Feynman symbolic regression benchmark dataset, which originates from the Feynman lecture series (Feynman, Leighton, and Sands 2011). Since our network is designed to discover multivariate LPs, we used the subset of LP equations from the Feynman dataset.

The 48 equations used to evaluate our method are given below.

1.

$E_{f}=\frac{1}{4\pi}*\frac{q_{1}q_{2}}{\epsilon r^{2}}$
2.

$U=gmz$
3.

$\frac{3}{8\pi G}(H_{G}^{2}+\frac{c^{2}k_{f}}{r^{2}})$
4.

$P=\frac{A\kappa(-T_{1}+T_{2})}{d}$
5.

$K=\frac{m}{2}(u^{2}+v^{2}+w^{2})$
6.

$A=x_{1}y_{1}+x_{2}y_{2}+x_{3}y_{3}$
7.

$E_{den}=E_{f}^{2}\epsilon$
8.

$m=\frac{({\frac{h}{2\pi}})^{2}}{E_{n}d^{2}}$
9.

$F=\frac{AYX}{d}$
10.

$U=Gm_{1}m_{2}(\frac{1}{r_{2}}-\frac{1}{r_{1}})$
11.

$\mu_{m}=\frac{qh}{4\pi m}$
12.

$F=\mu N_{n}$
13.

$F=\frac{q_{1}q_{2}}{4\pi\epsilon r^{2}}$
14.

$F=q_{2}E_{f}$
15.

$U=\frac{1}{2}k_{spring}x^{2}$
16.

$E_{n}=\frac{1}{2}(\omega^{2}+\omega_{0}^{2})x^{2}$
17.

$V_{e}=\frac{q}{c}$
18.

$k=\frac{\omega}{c}$
19.

$P=\frac{q^{2}a^{2}}{6\pi\epsilon c^{3}}$
20.

$E_{n}=\frac{h}{2\pi}\omega$
21.

$\omega=\frac{qvB}{p}$
22.

$r\frac{4\pi\epsilon h^{2}}{mq^{2}}$
23.

$\frac{3}{2}p_{r}V$
24.

$P_{F}=\frac{nk_{b}T}{V}$
25.

$v=\frac{\mu_{drift}qV_{e}}{d}$
26.

$D=\mu_{e}k_{b}T$
27.

$P_{*}=\frac{n_{\rho}p_{d}^{2}E_{f}}{3k_{b}T}$
28.

$B=\frac{1}{4\pi{\epsilon}c^{2}}\frac{2l}{r}$
29.

$F_{e}=\epsilon cE_{f}^{2}$
30.

$F_{E}=\frac{P}{4\pi r^{2}}$
31.

$\omega=\frac{g_{\_}qB}{2m}$
32.

$E=\frac{g_{\_}\mu_{M}BJ_{z}}{h}$
33.

$I=\frac{qv}{2\pi r}$
34.

$\mu_{M}=\frac{qvr}{2}$
35.

$f=\frac{\mu_{m}B}{k_{b}T}+\frac{\mu_{m}\alpha}{\epsilon c^{2}k_{b}T}$
36.

$E=\mu_{M}(1+\chi)B$
37.

$V_{e}=\frac{q}{4\pi\epsilon r}$
38.

$E_{den}=\frac{\epsilon E_{f}^{2}}{2}$
39.

$E=\frac{3}{5}\frac{q^{2}}{4\pi\epsilon d}$
40.

$L=\frac{nh}{2\pi}$
41.

$v=\frac{4\pi E_{n}d^{2}k}{h}$
42.

$k=\frac{2\pi\alpha}{nd}$
43.

$E=\frac{-mq^{4}}{2(4\pi\epsilon)^{2}h^{2}}\frac{1}{n^{2}}$
44.

$j=\frac{-\rho_{c_{0}}qA_{vec}}{m}$
45.

$\omega=\frac{4\pi\mu_{M}B}{h}$
46.

$E_{n}=\frac{1}{2m}(p^{2}+m^{2}\omega^{2}x^{2}(1+\frac{\alpha x}{y}))$
47.

$Pr=\frac{-1}{8\pi G}(\frac{c^{4}k_{f}}{r^{2}}+H_{G}^{2}c^{2}(1-2\alpha))$
48.

$P=-\frac{32}{5}\frac{G^{4}}{c^{5}}\frac{(m_{1}m_{2})^{2}(m_{1}+m_{2})}{r^{5}}$

Appendix d Contemporary symbolic regression algorithms compared with the proposed interpretable neural network (GINN-LP)

Using SRBench, we compare our method with 14 symbolic regression algorithms seen in contemporary literature. These methods are listed below in Table 3.

Table 3: List of algorithms compared against GINN-LP

Description	Method name	Year	Method family
Age fitness pareto optimization	AFP (Schmidt and Lipson 2010)	2011	Genetic programming (GP)
Age fitness pareto optimization with fitness estimates	AFP-FE (Schmidt and Lipson 2009)	2011	Genetic programming (GP)
AIFeynman	AIFeynman (Udrescu and Tegmark 2020)	2020	Divide and conquer / Neural networks
Bayesian symbolic regression	BSR (Jin et al. 2019)	2020	Bayesian methods / Markov chain Monte Carlo
Deep symbolic regression	DSR (Petersen et al. 2021)	2021	Deep learning
Epsilon lexicase selection	EPLEX (La Cava et al. 2019a)	2016	Genetic programming (GP)
Feature engineering automation tool	FEAT (La Cava et al. 2019b)	2019	Genetic programming (GP)
Fast function extraction	FFX (McConaghy 2011)	2011	Random search
GP version of the gene-pool optimal mixing evolutionary algorithm	GP-GOMEA (Virgolin et al. 2021)	2020	Genetic programming (GP)
GP for symbolic regression in Python	gplearn	2015	Genetic programming (GP)
Interaction transformation evolutionary algorithm	ITEA (de Franca and Aldeia 2021)	2020	Genetic programming (GP)
Multiple regression genetic programming	MRGP (Arnaldo, Krawiec, and O’Reilly 2014)	2014	Genetic programming (GP)
SR with non-linear least squares	Operon (Kommenda et al. 2020)	2019	Genetic programming (GP)
Semantic back-propagation genetic programming	SBP-GP (Virgolin, Alderliesten, and Bosman 2019)	2019	Genetic programming (GP)

Appendix e Comparison of SR methods in terms of $R^{2}>0.99$ accuracy, when a high amount of noise if present

When a high amount of noise is present, GINN-LP solution rate drops as shown earlier in the paper. However, GINN-LP still performs very well in modelling the data, even though it may not recover the exact groundtruth equation. Further analysis of the model outputs show that GINN-LP achieves a $R_{2}>0.99$ accuracy of 100%, even with the highest noise ratio of 0.1, as shown in Figure 11.

Appendix f Performance comparison of GINN-LP, ensembled with all other methods

We perform experiments after ensembling GINN-LP with all 14 contemporary methods used for comparison. Performance was evaluated using solution rate and $R_{2}>0.99$ accuracy as defined earlier. Almost all methods improve upon ensembling with GINN-LP as shown in Table 4.

Table 4: Performance of SR methods after ensembling with GINN-LP

SR Method	solution rate - standalone	solution rate - ensembled with GINN-LP	solution rate improvement	$R_{2}>0.99$ accuracy - standalone	$R_{2}>0.99$ accuracy - Ensembled with GINN-LP	$R_{2}>0.99$ accuracy improvement
AFP	21.2%	42.5%	21.2%	58.4%	67.3%	8.8%
AFP_FE	27.4%	45.1%	17.7%	74.3%	78.8%	4.4%
AIFeynman	54.9%	61.9%	7.1%	85.8%	87.6%	1.8%
BSR	2.2%	38.1%	35.8%	22.6%	49.6%	27.0%
DSR	23.0%	46.9%	23.9%	40.7%	57.5%	16.8%
EPLEX	13.3%	40.7%	27.4%	80.1%	81.4%	1.3%
FEAT	0.0%	38.1%	38.1%	54.4%	69.9%	15.5%
FFX	0.0%	38.1%	38.1%	42.5%	63.7%	21.2%
GP-GOMEA	27.0%	46.9%	19.9%	90.7%	92.0%	1.3%
ITEA	23.0%	39.8%	16.8%	40.7%	57.5%	16.8%
MRGP	0.0%	38.1%	38.1%	98.2%	99.1%	0.9%
Operon	17.7%	38.9%	21.2%	95.1%	94.7%	-0.4%
SBP-GP	12.4%	42.5%	30.1%	94.7%	95.6%	0.9%
gplearn	16.4%	42.5%	26.1%	52.7%	65.5%	12.8%

Appendix g Effect of epoch count and number of training instances on GINN-LP performance

Effect of epoch count. At each network growth stage, the network is trained for a pre-determined number of epochs. We train GINN-LP with a varying number of epochs per network growth stage (on all 48 datasets, 5 trials per dataset). As demonstrated in Figure 12(a), training each network growth stage for 500 epochs showed the highest performance and therefore was used for the rest of the experiments.

Impact of the number of training instances. We train multiple instances of GINN-LP and choose the best model as per Equation 9. The performance results for fixed training instance counts between 1 and 5 are shown in Figure 12(b). With more training instances, the likelihood of at least one recovering the correct ground-truth equation rises, increasing the overall solution rate. Due to increased training time with more instances, we opt for four instances, striking a balance between speed and performance.