GSR: A Generalized Symbolic Regression Approach
Abstract
Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method is competitive with strong SR benchmark methods, achieving promising experimental performance on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
1 Introduction
Symbolic regression (SR) aims to find a mathematical expression that best describes the relationship between the independent variables and the target (or dependent) variable based on a given dataset. By inspecting the resulting expression, we may be able to identify nontrivial relations and/or physical laws which can provide more insight into the system represented by the given dataset. SR has gained tremendous interest and attention from researchers over the years for many reasons. First, many rules and laws in natural sciences (e.g. in physical and dynamical systems (schmidt2009distilling; quade2016prediction)) are accurately represented by simple analytical equations (which can be explicit (brunton2016discovering) or implicit (mangan2016inferring; kaheman2020sindy)). Second, in contrast to neural networks that involve complex input-output mapping, and hence are often treated as black boxes which are difficult to interpret, SR is very concise and interpretable. Finally, symbolic equations may outperform neural networks in out-of-distribution generalization (especially for physical problems) (cranmer2020discovering).
SR does not require a priori specification of a model. Conventional regression methods such as least squares (wild1989nonlinear), likelihood-based (edwards1984likelihood; pawitan2001all; tohme2021improving), and Bayesian regression techniques (lee1997bayesian; leonard2001bayesian; tohme2020bayesian; tohme2020generalized; vanslette2020general) use fixed-form parametric models and optimize for the model parameters only. SR seeks to find both a model structure and its associated parameters simultaneously.
Related Work. The SR problem has been widely studied in the literature (orzechowski2018we; la2021contemporary). SR can be a very challenging problem and is thought to be NP-hard (lu2016using; udrescu2020ai; petersen2021deep; virgolin2022symbolic). It can also be computationally expensive as the search space is very wide (or complex) containing expressions of any size and length de2018greedy, this issue being exacerbated with the dimension of the input feature vector (i.e. the number of independent variables). Several approaches have been suggested over the years. Most of the methods use genetic (or evolutionary) algorithms (koza1992genetic; schmidt2009distilling; back2018evolutionary; virgolin2019linear). Some more recent methods are Bayesian in nature (jin2019bayesian), some are physics-inspired (udrescu2020ai), and others use divide-and-conquer (luo2017divide) and block building algorithms (chen2017fast; chen2018block; chen2018multilevel). Lately, researchers proposed using machine learning algorithms and neural networks to solve the SR problem (martius2016extrapolation; sahoo2018learning; udrescu2020ai2; AhnKLS20; al2020universal; kim2020integration; kommenda2020parameter; operonc++; biggio2021neural; mundhenk2021symbolic; petersen2021deep; valipour2021symbolicgpt; razavi2022neural; zhang2022ps; d2022deep; kamienny2022end; zhang2022deep). Furthermore, some works suggested constraining the search space of functions to generalized linear space (nelder1972generalized) (e.g. Fast Function eXtraction (mcconaghy2011ffx), Elite Bases Regression (chen2017elite), etc.) which proved to accelerate the convergence of genetic algorithms significantly (at the expense of sometimes losing the generality of the solution (luo2017divide)).
Most of the SR methods use a tree-based implementation, where analytical functions are represented (or encoded) by expression trees. Some approaches suggested encoding functions as an integer string (o2001grammatical), others proposed representing them using matrices (luo2012parse; chen2017elite; de2018greedy; de2020interaction). As we will discuss in later sections, our implementation relies on matrices to encode expressions.
Our Contribution. We present Generalized Symbolic Regression (GSR), by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we identify mathematical relationships between the independent variables (or features) and some transformation of the target variable. In other words, we learn the mapping from the feature space to a transformed target space (where the transformation applied to the target variable is also learned during this process). To find the appropriate functions (or transformations) to be applied to the features as well as to the targets, we constrain our search space to a weighted sum of basis functions. In contrast to conventional tree-based genetic programming approaches, we propose a matrix-based encoding scheme to represent the basis functions (and hence the full mathematical expressions). We run a series of numerical experiments on the well-known SR benchmark datasets and show that our proposed method is competitive with many strong SR methods. Finally, we introduce SymSet, a new SR benchmark problem set that is more challenging than existing benchmarks.
2 Notation and Problem Formulation
Consider the following regression task. We are given a dataset consisting of i.i.d. paired examples, where denotes the -dimensional input feature vector and represents the corresponding continuous target variable. The goal of SR is to search the space of all possible mathematical expressions defined by a set of given mathematical functions (e.g., , , , ) and arithmetic operations (e.g., , , , ), along with the following optimization problem:
(1) |
where is the model function and is the optimal model.
3 Generalized Symbolic Regression (GSR)
In this section, we introduce our Generalized Symbolic Regression (GSR) approach. We present its problem formulation, and discuss its solution and implementation.
3.1 Modifying the goal of symbolic regression
As highlighted in Section 2, the goal of SR is to search the function space to find the model that best fits the mapping between the independent variables and the target variable (i.e. the mapping between and , for all ). Since the main objective of SR is to recognize correlations and find non-trivial interpretable models (rather than making direct predictions), we modify the goal of SR; we instead search the function space to find the model that best describes the mapping between the independent variables and a transformation of the target variable (i.e. the mapping between and some transformation or function of , for all ). Formally, we propose modifying the goal of SR to search for appropriate (model) functions from a space of all possible mathematical expressions defined by a set of given mathematical functions (e.g., , , , ) and arithmetic operations (e.g., , , , ), which can be described by the following optimization problem:
(2) |
where and are the optimal analytical functions. In other words, instead of searching for mathematical expressions of the form as is usually done in the SR literature, the proposed GSR approach attempts to find expressions of the form . We illustrate this concept in Table 1.
Ground Truth Expression | Learned Expression |
---|---|
Although the main goal of GSR is to find expressions of the form , we may encounter situations where it is best to simply learn expressions of the form (i.e. ). For instance, consider the ground truth expression . In this case, we expect to learn the expression exactly as is (i.e. and ) as long as the right basis functions (i.e. and in this case) are within the search space, as we will see in the next sections.
Making predictions. Given a new input feature vector , predicting with GSR is simply a matter of solving the equation for , or equivalently, . Note that is a known quantity and is the only unknown. If is an invertible function, then can be easily found using . If is not invertible, then will be the root of the function . Root-finding algorithms include Newtonโs method. Whether the function is invertible or not, we might end up with many solutions for (an invertible function, which is not one-to-one, can lead to more than one solution). In this case, we choose to be the solution that belongs to the range of which can be determined from the training dataset.
3.2 A new problem formulation for symbolic regression
Now that we have presented the goal of our proposed GSR approach (summarized by Equation 2), we need to constrain the search space of functions to reduce the computational challenges and accelerate the convergence of our algorithm. Inspired by mcconaghy2011ffx; chen2017elite as well as classical system identification methods (brunton2016discovering), we confine to generalized linear models, i.e. to functions that can be expressed as a linear combination (or as a weighted sum) of basis functions (which can be linear or nonlinear). In mathematical terms, for a given input feature vector and a corresponding target variable , the search space is constrained to model functions of the form:
(3) |
where and are the basis functions applied to the feature vector and the target variable , respectively, and denote the corresponding number of basis functions involved, respectively. In matrix form, the minimization problem described in Equation 2 is equivalent to finding the vectors of coefficients and such that:
(4) |
where
(5) |
Note that if we examine the minimization problem as expressed in Equation 4, we can indeed minimize by simply setting and which will not lead to a meaningful solution to our GSR problem. In addition, to avoid reaching overly complex mathematical expressions for and , we are interested in finding sparse solutions for the weight vectors and consisting mainly of zeros which results in simple analytical functions containing only the surviving basis functions (i.e. whose corresponding weights are nonzero). This is closely related to sparse identification of nonlinear dynamics (SINDy) methods (brunton2016discovering). To this end, we apply regularization, also known as Lasso regression (tibshirani1996regression), by adding a penalty on the norm of the weights vector (i.e. the sum of its absolute values) which leads to sparse solutions with few nonzero coefficients. In terms of our GSR method, Lasso regression automatically performs basis functions selection from the set of basis functions that are under consideration.
Putting the pieces together, we reformulate the minimization problem in Equation 4 as a constrained Lasso regression optimization problem defined as
(6) | ||||
s.t. |
where is the regularization parameter, and
(7) |
3.3 Solving the GSR problem
To solve the GSR problem, we first present our approach for solving the constrained Lasso problem in Equation 6, assuming some particular sets of basis functions are given. We then outline our genetic programming (GP) procedure for finding the appropriate (or optimal) sets of these basis functions, before discussing our matrix-based encoding scheme (to represent the basis functions) that we will use in our GP algorithm.
3.3.1 Solving the Lasso optimization problem given particular sets of basis functions
We assume for now that, in addition to the dataset , we are also given the sets of basis functions and used with the input feature vector and its corresponding target variable , respectively, for all . In other words, we assume for now that the matrix in Equation 7 is formed based on particular sets of basis functions (i.e. and ), and we are mainly interested in solving the constrained optimization problem in Equation 6. Applying the alternating direction method of multipliers (ADMM) (boyd2011distributed), the optimization problem in Equation 6 can be written as
(8) | ||||
s.t. | ||||
where is the regularization parameter. The scaled form of ADMM (see boyd2011distributed for details) for this problem is
(9) | ||||
where is the scaled dual vector, and
(10) |
where is the penalty parameter and the soft thresholding operator is defined as
(11) |
To find the minimizer in the first step of the ADMM algorithm above (in Equation 9), we first compute the gradient of the function with respect to , set it to zero, and then normalize the resulting vector solution:
(12) | ||||
It follows that
(13) | ||||
3.3.2 Finding the appropriate sets of basis functions using genetic programming
Now that we have presented Algorithm 1 that solves the constrained Lasso optimization problem in Equation 6 for particular sets of basis functions, we go through our procedure for finding the optimal sets of basis functions (and hence, the optimal analytical functions and ).
Encoding Scheme
Most of the SR methods rely on expression trees in their implementation. That is, each mathematical expression is represented by a tree where nodes (including the root) encode arithmetic operations (e.g. , , , ) or mathematical functions (e.g. , , , ), and leaves contain the independent variables (i.e. ) or constants.
Inspired by luo2012parse; chen2017elite, we use matrices instead of trees to represent the basis functions. However, we propose our own encoding scheme that we believe is general enough to handle/recover a wide range of expressions.
We introduce the basis matrices and to represent the basis functions and used with the feature vector and the target variable , respectively. The basis matrices and are of sizes and respectively (i.e. is a column vector), and take the form
(14) |
where the entries and are all integers. The first column of and the first (and only) column of indicate the mathematical function (or transformation) to be applied (on the input feature vector and the target variable , respectively). In , the second column specifies the type of argument (see Table 3), and the remaining columns indicate which independent variables (or features) are involved (i.e. the active operands). The quantity represents the maximum total multiplicity of all the independent variables included in the argument. Note that and specify the number of transformations to be multiplied together (i.e. each basis function and will be a product of and transformations, respectively).
The encoding/decoding process happens according to a table of mapping rules that is very straightforward to understand and employ. For instance, consider the mapping rules outlined in Table 3, where is the dimension of the input feature vector. As we will see in our numerical experiments in Section LABEL:section4, we will adopt this table for many SR benchmark problems. Other mapping tables are defined according to different benchmark problems111Each SR benchmark problem uses a specific set (or library) of allowable mathematical functions (e.g. , , , ), and hence, we mainly modify the first two rows of the mapping tables. (more details about the SR benchmark problem specifications can be found in Appendix LABEL:appendixspecifications). The encoding from the analytical form of a basis function to the basis matrix is straightforward. For example, for , and (i.e. ), the basis function can be generated according to the encoding steps shown in Table 3.
Based on the mapping rules in Table 3 and the encoding steps in Table 3, the basis function can be described by a matrix as follows:
(15) |
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
Transformation () | 1 | |||||
0 | 1 | 2 | ||||
Argument Type () | ||||||
0 | 1 | 2 | 3 | |||
Variable () | skip |
basis function .
Step | Update | |||||
1 | โ | โ | ||||
2 | ||||||
3 | skip | |||||
4 | โ | โ | โ | โ | ||
Final Update: โ |
Remark 3.1.
In Table 3, โ denotes entries that are ignored during the construction of the basis function. The argument type in Step 1 is which implies that we only select the first variable (encoded by ) out of the variables as an argument, and hence the entries corresponding to and are ignored. Similarly, the transformation in Step 4 is which implies that the argument type and the variables are all ignored. These are the only two cases where some entries are ignored during the construction process. The same ignored entries are reflected in the matrix using . More encoding examples can be found in Appendix LABEL:moreencodexamples.