This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GSR: A Generalized Symbolic Regression Approach

Tony Tohme [email protected]
Massachusetts Institute of Technology
Dehong Liu [email protected]
Mitsubishi Electric Research Laboratories
Kamal Youcef-Toumi [email protected]
Massachusetts Institute of Technology
Abstract

Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method is competitive with strong SR benchmark methods, achieving promising experimental performance on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.

1 Introduction

Symbolic regression (SR) aims to find a mathematical expression that best describes the relationship between the independent variables and the target (or dependent) variable based on a given dataset. By inspecting the resulting expression, we may be able to identify nontrivial relations and/or physical laws which can provide more insight into the system represented by the given dataset. SR has gained tremendous interest and attention from researchers over the years for many reasons. First, many rules and laws in natural sciences (e.g. in physical and dynamical systems (schmidt2009distilling; quade2016prediction)) are accurately represented by simple analytical equations (which can be explicit (brunton2016discovering) or implicit (mangan2016inferring; kaheman2020sindy)). Second, in contrast to neural networks that involve complex input-output mapping, and hence are often treated as black boxes which are difficult to interpret, SR is very concise and interpretable. Finally, symbolic equations may outperform neural networks in out-of-distribution generalization (especially for physical problems) (cranmer2020discovering).

SR does not require a priori specification of a model. Conventional regression methods such as least squares (wild1989nonlinear), likelihood-based (edwards1984likelihood; pawitan2001all; tohme2021improving), and Bayesian regression techniques (lee1997bayesian; leonard2001bayesian; tohme2020bayesian; tohme2020generalized; vanslette2020general) use fixed-form parametric models and optimize for the model parameters only. SR seeks to find both a model structure and its associated parameters simultaneously.

Related Work. The SR problem has been widely studied in the literature (orzechowski2018we; la2021contemporary). SR can be a very challenging problem and is thought to be NP-hard (lu2016using; udrescu2020ai; petersen2021deep; virgolin2022symbolic). It can also be computationally expensive as the search space is very wide (or complex) containing expressions of any size and length de2018greedy, this issue being exacerbated with the dimension of the input feature vector (i.e. the number of independent variables). Several approaches have been suggested over the years. Most of the methods use genetic (or evolutionary) algorithms (koza1992genetic; schmidt2009distilling; back2018evolutionary; virgolin2019linear). Some more recent methods are Bayesian in nature (jin2019bayesian), some are physics-inspired (udrescu2020ai), and others use divide-and-conquer (luo2017divide) and block building algorithms (chen2017fast; chen2018block; chen2018multilevel). Lately, researchers proposed using machine learning algorithms and neural networks to solve the SR problem (martius2016extrapolation; sahoo2018learning; udrescu2020ai2; AhnKLS20; al2020universal; kim2020integration; kommenda2020parameter; operonc++; biggio2021neural; mundhenk2021symbolic; petersen2021deep; valipour2021symbolicgpt; razavi2022neural; zhang2022ps; d2022deep; kamienny2022end; zhang2022deep). Furthermore, some works suggested constraining the search space of functions to generalized linear space (nelder1972generalized) (e.g. Fast Function eXtraction (mcconaghy2011ffx), Elite Bases Regression (chen2017elite), etc.) which proved to accelerate the convergence of genetic algorithms significantly (at the expense of sometimes losing the generality of the solution (luo2017divide)).

Most of the SR methods use a tree-based implementation, where analytical functions are represented (or encoded) by expression trees. Some approaches suggested encoding functions as an integer string (o2001grammatical), others proposed representing them using matrices (luo2012parse; chen2017elite; de2018greedy; de2020interaction). As we will discuss in later sections, our implementation relies on matrices to encode expressions.

Our Contribution. We present Generalized Symbolic Regression (GSR), by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we identify mathematical relationships between the independent variables (or features) and some transformation of the target variable. In other words, we learn the mapping from the feature space to a transformed target space (where the transformation applied to the target variable is also learned during this process). To find the appropriate functions (or transformations) to be applied to the features as well as to the targets, we constrain our search space to a weighted sum of basis functions. In contrast to conventional tree-based genetic programming approaches, we propose a matrix-based encoding scheme to represent the basis functions (and hence the full mathematical expressions). We run a series of numerical experiments on the well-known SR benchmark datasets and show that our proposed method is competitive with many strong SR methods. Finally, we introduce SymSet, a new SR benchmark problem set that is more challenging than existing benchmarks.

2 Notation and Problem Formulation

Consider the following regression task. We are given a dataset ๐’Ÿ={๐ฑi,yi}i=1N\mathcal{D}=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{N} consisting of NN i.i.d. paired examples, where ๐ฑiโˆˆโ„d\mathbf{x}_{i}\in\mathbb{R}^{d} denotes the ithi^{\text{th}} dd-dimensional input feature vector and yiโˆˆโ„y_{i}\in\mathbb{R} represents the corresponding continuous target variable. The goal of SR is to search the space of all possible mathematical expressions ๐’ฎ\mathcal{S} defined by a set of given mathematical functions (e.g., exp\exp, ln\ln, sin\sin, cos\cos) and arithmetic operations (e.g., ++, โˆ’-, ร—\times, รท\div), along with the following optimization problem:

fโˆ—=argโ€‹minfโˆˆ๐’ฎโ€‹โˆ‘i=1N[fโ€‹(๐ฑi)โˆ’yi]2\displaystyle f^{*}=\text{arg}\,\underset{f\in\mathcal{S}}{\min}\,\sum_{i=1}^{N}\big{[}f(\mathbf{x}_{i})-y_{i}\big{]}^{2} (1)

where ff is the model function and fโˆ—f^{*} is the optimal model.

3 Generalized Symbolic Regression (GSR)

In this section, we introduce our Generalized Symbolic Regression (GSR) approach. We present its problem formulation, and discuss its solution and implementation.

3.1 Modifying the goal of symbolic regression

As highlighted in Section 2, the goal of SR is to search the function space to find the model that best fits the mapping between the independent variables and the target variable (i.e. the mapping between ๐ฑi\mathbf{x}_{i} and yiy_{i}, for all ii). Since the main objective of SR is to recognize correlations and find non-trivial interpretable models (rather than making direct predictions), we modify the goal of SR; we instead search the function space to find the model that best describes the mapping between the independent variables and a transformation of the target variable (i.e. the mapping between ๐ฑi\mathbf{x}_{i} and some transformation or function of yiy_{i}, for all ii). Formally, we propose modifying the goal of SR to search for appropriate (model) functions from a space of all possible mathematical expressions ๐’ฎ\mathcal{S} defined by a set of given mathematical functions (e.g., exp\exp, ln\ln, sin\sin, cos\cos) and arithmetic operations (e.g., ++, โˆ’-, ร—\times, รท\div), which can be described by the following optimization problem:

fโˆ—,gโˆ—=argโ€‹minf,gโˆˆ๐’ฎโ€‹โˆ‘i=1N[fโ€‹(๐ฑi)โˆ’gโ€‹(yi)]2\displaystyle f^{*},{\color[rgb]{0,0,1}g^{*}}=\text{arg}\,\underset{f,{\color[rgb]{0,0,1}g}\,\in\mathcal{S}}{\min}\sum_{i=1}^{N}\big{[}f(\mathbf{x}_{i})-{\color[rgb]{0,0,1}g(}y_{i}{\color[rgb]{0,0,1})}\big{]}^{2} (2)

where fโˆ—f^{*} and gโˆ—{\color[rgb]{0,0,1}g^{*}} are the optimal analytical functions. In other words, instead of searching for mathematical expressions of the form y=fโ€‹(๐ฑ)y=f(\mathbf{x}) as is usually done in the SR literature, the proposed GSR approach attempts to find expressions of the form gโ€‹(y)=fโ€‹(๐ฑ)g(y)=f(\mathbf{x}). We illustrate this concept in Table 1.

Table 1: GSR finds analytical expressions of the form gโ€‹(y)=fโ€‹(๐ฑ)g(y)=f(\mathbf{x}) instead of y=fโ€‹(๐ฑ)y=f(\mathbf{x}).
Ground Truth Expression Learned Expression
y=x+5y=\sqrt{x+5} y2=x+5y^{2}=x+5
y=1/(3โ€‹x1+x23)y=1/(3x_{1}+x_{2}^{3}) yโˆ’1=3โ€‹x1+x23y^{-1}=3x_{1}+x_{2}^{3}
y=(2โ€‹x1+x2)โˆ’23y=(2x_{1}+x_{2})^{-\frac{2}{3}} lnโก(y)=โˆ’23โ€‹lnโก(2โ€‹x1+x2)\ln(y)=-\frac{2}{3}\ln(2x_{1}+x_{2})
y=lnโก(x13+4โ€‹x1โ€‹x2)y=\ln(x_{1}^{3}+4x_{1}x_{2}) ey=x13+4โ€‹x1โ€‹x2e^{y}=x_{1}^{3}+4x_{1}x_{2}
y=ex13+2โ€‹x2+cosโก(x3)y=e^{x_{1}^{3}+2x_{2}+\cos(x_{3})} lnโก(y)=x13+2โ€‹x2+cosโก(x3)\ln(y)=x_{1}^{3}+2x_{2}+\cos(x_{3})

Although the main goal of GSR is to find expressions of the form gโ€‹(y)=fโ€‹(๐ฑ)g(y)=f(\mathbf{x}), we may encounter situations where it is best to simply learn expressions of the form y=fโ€‹(๐ฑ)y=f(\mathbf{x}) (i.e. gโ€‹(y)=yg(y)=y). For instance, consider the ground truth expression y=sinโก(x1)+2โ€‹x2y=\sin(x_{1})+2x_{2}. In this case, we expect to learn the expression exactly as is (i.e. gโ€‹(y)=yg(y)=y and fโ€‹(๐ฑ)=sinโก(x1)+2โ€‹x2f(\mathbf{x})=\sin(x_{1})+2x_{2}) as long as the right basis functions (i.e. sinโก(x)\sin(x) and xx in this case) are within the search space, as we will see in the next sections.

Making predictions. Given a new input feature vector ๐ฑโˆ—\mathbf{x}_{*}, predicting yโˆ—y_{*} with GSR is simply a matter of solving the equation gโ€‹(y)=fโ€‹(๐ฑโˆ—)g(y)=f(\mathbf{x}_{*}) for yy, or equivalently, gโ€‹(y)โˆ’fโ€‹(๐ฑโˆ—)=0g(y)-f(\mathbf{x}_{*})=0. Note that fโ€‹(๐ฑโˆ—)f(\mathbf{x}_{*}) is a known quantity and yy is the only unknown. If gโ€‹(โ‹…)g(\cdot) is an invertible function, then yโˆ—y_{*} can be easily found using yโˆ—=gโˆ’1โ€‹(fโ€‹(๐ฑโˆ—))y_{*}=g^{-1}\big{(}f(\mathbf{x}_{*})\big{)}. If gโ€‹(โ‹…)g(\cdot) is not invertible, then yโˆ—y_{*} will be the root of the function hโ€‹(y)=gโ€‹(y)โˆ’fโ€‹(๐ฑโˆ—)h(y)=g(y)-f(\mathbf{x}_{*}). Root-finding algorithms include Newtonโ€™s method. Whether the function gโ€‹(โ‹…)g(\cdot) is invertible or not, we might end up with many solutions for yโˆ—y_{*} (an invertible function, which is not one-to-one, can lead to more than one solution). In this case, we choose yโˆ—y_{*} to be the solution that belongs to the range of yy which can be determined from the training dataset.

3.2 A new problem formulation for symbolic regression

Now that we have presented the goal of our proposed GSR approach (summarized by Equation 2), we need to constrain the search space of functions ๐’ฎ\mathcal{S} to reduce the computational challenges and accelerate the convergence of our algorithm. Inspired by mcconaghy2011ffx; chen2017elite as well as classical system identification methods (brunton2016discovering), we confine ๐’ฎ\mathcal{S} to generalized linear models, i.e. to functions that can be expressed as a linear combination (or as a weighted sum) of basis functions (which can be linear or nonlinear). In mathematical terms, for a given input feature vector ๐ฑi\mathbf{x}_{i} and a corresponding target variable yiy_{i}, the search space ๐’ฎ\mathcal{S} is constrained to model functions of the form:

fโ€‹(๐ฑi)=โˆ‘j=1Mฯ•ฮฑjโ€‹ฯ•jโ€‹(๐ฑi),gโ€‹(yi)=โˆ‘j=1Mฯˆฮฒjโ€‹ฯˆjโ€‹(yi)\displaystyle f(\mathbf{x}_{i})=\sum_{j=1}^{M_{\phi}}\alpha_{j}\phi_{j}(\mathbf{x}_{i}),\qquad g(y_{i})=\sum_{j=1}^{M_{\psi}}\beta_{j}\psi_{j}(y_{i}) (3)

where ฯ•jโ€‹(โ‹…)\phi_{j}(\cdot) and ฯˆjโ€‹(โ‹…)\psi_{j}(\cdot) are the basis functions applied to the feature vector ๐ฑi\mathbf{x}_{i} and the target variable yiy_{i}, respectively, Mฯ•M_{\phi} and MฯˆM_{\psi} denote the corresponding number of basis functions involved, respectively. In matrix form, the minimization problem described in Equation 2 is equivalent to finding the vectors of coefficients ๐œถ=[ฮฑ1โ€‹โ‹ฏโ€‹ฮฑMฯ•]T\bm{\alpha}=[\alpha_{1}\,\cdots\,\alpha_{M_{\phi}}]^{T} and ๐œท=[ฮฒ1โ€‹โ‹ฏโ€‹ฮฒMฯˆ]T\bm{\beta}=[\beta_{1}\,\cdots\,\beta_{M_{\psi}}]^{T} such that:

๐œถโˆ—,๐œทโˆ—\displaystyle\bm{\alpha}^{*},\bm{\beta}^{*} =argโ€‹min๐œถ,๐œทโ€‹โ€–๐—โ€‹๐œถโˆ’๐˜โ€‹๐œทโ€–2\displaystyle=\text{arg}\,\underset{\bm{\alpha},\bm{\beta}}{\min}\,||\mathbf{X}\bm{\alpha}-\mathbf{Y}\bm{\beta}||^{2} (4)

where

๐—=[ฯ•1โ€‹(๐ฑ1)ฯ•2โ€‹(๐ฑ1)โ‹ฏฯ•Mฯ•โ€‹(๐ฑ1)โ‹ฎโ‹ฎโ‹ฏโ‹ฎฯ•1โ€‹(๐ฑN)ฯ•2โ€‹(๐ฑN)โ‹ฏฯ•Mฯ•โ€‹(๐ฑN)],๐˜=[ฯˆ1โ€‹(y1)ฯˆ2โ€‹(y1)โ‹ฏฯˆMฯˆโ€‹(y1)โ‹ฎโ‹ฎโ‹ฏโ‹ฎฯˆ1โ€‹(yN)ฯˆ2โ€‹(yN)โ‹ฏฯˆMฯˆโ€‹(yN)].\displaystyle\mathbf{X}=\begin{bmatrix}\phi_{1}(\mathbf{x}_{1})&\phi_{2}(\mathbf{x}_{1})&\cdots&\phi_{M_{\phi}}(\mathbf{x}_{1})\\ \vdots&\vdots&\cdots&\vdots\\ \phi_{1}(\mathbf{x}_{N})&\phi_{2}(\mathbf{x}_{N})&\cdots&\phi_{M_{\phi}}(\mathbf{x}_{N})\\ \end{bmatrix},\quad\mathbf{Y}=\begin{bmatrix}\psi_{1}(y_{1})&\psi_{2}(y_{1})&\cdots&\psi_{M_{\psi}}(y_{1})\\ \vdots&\vdots&\cdots&\vdots\\ \psi_{1}(y_{N})&\psi_{2}(y_{N})&\cdots&\psi_{M_{\psi}}(y_{N})\\ \end{bmatrix}. (5)

Note that if we examine the minimization problem as expressed in Equation 4, we can indeed minimize โ€–๐—โ€‹๐œถโˆ’๐˜โ€‹๐œทโ€–2||\mathbf{X}\bm{\alpha}-\mathbf{Y}\bm{\beta}||^{2} by simply setting ๐œถโˆ—=0\bm{\alpha}^{*}=0 and ๐œทโˆ—=0\bm{\beta}^{*}=0 which will not lead to a meaningful solution to our GSR problem. In addition, to avoid reaching overly complex mathematical expressions for fโ€‹(โ‹…)f(\cdot) and gโ€‹(โ‹…)g(\cdot), we are interested in finding sparse solutions for the weight vectors ๐œถโˆ—\bm{\alpha}^{*} and ๐œทโˆ—\bm{\beta}^{*} consisting mainly of zeros which results in simple analytical functions containing only the surviving basis functions (i.e. whose corresponding weights are nonzero). This is closely related to sparse identification of nonlinear dynamics (SINDy) methods (brunton2016discovering). To this end, we apply L1L_{1} regularization, also known as Lasso regression (tibshirani1996regression), by adding a penalty on the L1L_{1} norm of the weights vector (i.e. the sum of its absolute values) which leads to sparse solutions with few nonzero coefficients. In terms of our GSR method, Lasso regression automatically performs basis functions selection from the set of basis functions that are under consideration.

Putting the pieces together, we reformulate the minimization problem in Equation 4 as a constrained Lasso regression optimization problem defined as

๐’˜โˆ—=argโ€‹min๐’˜\displaystyle\bm{w}^{*}=\text{arg}\,\underset{\bm{w}}{\min} โ€–๐‘จโ€‹๐’˜โ€–22+ฮปโ€‹โ€–๐’˜โ€–1\displaystyle\,\,||\bm{A}\bm{w}||_{2}^{2}+\lambda||\bm{w}||_{1} (6)
s.t. โ€–๐’˜โ€–2=1\displaystyle\,\,||\bm{w}||_{2}=1

where ฮป>0\lambda>0 is the regularization parameter, and

๐‘จ=[๐‘ฟโˆ’๐’€],๐’˜=[๐œถ๐œท]=[ฮฑ1โ€‹โ‹ฏโ€‹ฮฑMฯ•โ€‹ฮฒ1โ€‹โ‹ฏโ€‹ฮฒMฯˆ]T.\displaystyle\bm{A}=\begin{bmatrix}\bm{X}&-\bm{Y}\end{bmatrix},\qquad\bm{w}=\begin{bmatrix}\bm{\alpha}\\[3.0pt] \bm{\beta}\end{bmatrix}=[\alpha_{1}\,\,\cdots\,\,\alpha_{M_{\phi}}\,\,\beta_{1}\,\,\cdots\,\,\beta_{M_{\psi}}]^{T}. (7)

3.3 Solving the GSR problem

To solve the GSR problem, we first present our approach for solving the constrained Lasso problem in Equation 6, assuming some particular sets of basis functions are given. We then outline our genetic programming (GP) procedure for finding the appropriate (or optimal) sets of these basis functions, before discussing our matrix-based encoding scheme (to represent the basis functions) that we will use in our GP algorithm.

3.3.1 Solving the Lasso optimization problem given particular sets of basis functions

We assume for now that, in addition to the dataset ๐’Ÿ={๐ฑi,yi}i=1N\mathcal{D}=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{N}, we are also given the sets of basis functions {ฯ•jโ€‹(๐ฑi)}j=1Mฯ•\{\phi_{j}(\mathbf{x}_{i})\}_{j=1}^{M_{\phi}} and {ฯˆjโ€‹(yi)}j=1Mฯˆ\{\psi_{j}(y_{i})\}_{j=1}^{M_{\psi}} used with the input feature vector ๐ฑi\mathbf{x}_{i} and its corresponding target variable yiy_{i}, respectively, for all ii. In other words, we assume for now that the matrix ๐‘จ\bm{A} in Equation 7 is formed based on particular sets of basis functions (i.e. {ฯ•jโ€‹(๐ฑi)}j=1Mฯ•\{\phi_{j}(\mathbf{x}_{i})\}_{j=1}^{M_{\phi}} and {ฯˆjโ€‹(yi)}j=1Mฯˆ\{\psi_{j}(y_{i})\}_{j=1}^{M_{\psi}}), and we are mainly interested in solving the constrained optimization problem in Equation 6. Applying the alternating direction method of multipliers (ADMM) (boyd2011distributed), the optimization problem in Equation 6 can be written as

๐’˜โˆ—=argโ€‹min๐’˜\displaystyle\bm{w}^{*}=\text{arg}\,\underset{\bm{w}}{\min} โ€–๐‘จโ€‹๐’˜โ€–22+ฮปโ€‹โ€–๐’›โ€–1\displaystyle\,\,||\bm{A}\bm{w}||_{2}^{2}+\lambda||\bm{z}||_{1} (8)
s.t. โ€–๐’˜โ€–2=1\displaystyle\,\,||\bm{w}||_{2}=1
๐’˜โˆ’๐’›=0\displaystyle\,\,\bm{w}-\bm{z}=0

where ฮป>0\lambda>0 is the regularization parameter. The scaled form of ADMM (see boyd2011distributed for details) for this problem is

๐’˜k\displaystyle\bm{w}_{k} โ†argโ€‹minโ€–๐’˜โ€–2=1โ€‹โ„’ฯโ€‹(๐’˜,๐’›kโˆ’1,๐’–kโˆ’1)\displaystyle\leftarrow\underset{||\bm{w}||_{2}=1}{\text{arg}\min}\,\,\mathcal{L}_{\rho}(\bm{w},\bm{z}_{k-1},\bm{u}_{k-1}) (9)
๐’›k\displaystyle\bm{z}_{k} โ†Sฮป/ฯโ€‹(๐’˜k+๐’–kโˆ’1)\displaystyle\leftarrow S_{\lambda/\rho}\big{(}\bm{w}_{k}+\bm{u}_{k-1}\big{)}
๐’–k\displaystyle\bm{u}_{k} โ†๐’–kโˆ’1+๐’˜kโˆ’๐’›k\displaystyle\leftarrow\bm{u}_{k-1}+\bm{w}_{k}-\bm{z}_{k}

where ๐’–\bm{u} is the scaled dual vector, and

โ„’ฯโ€‹(๐’˜,๐’›,๐’–)=โ€–๐‘จโ€‹๐’˜โ€–22+ฯ2โ€‹โ€–๐’˜โˆ’๐’›+๐’–โ€–22\displaystyle\mathcal{L}_{\rho}(\bm{w},\bm{z},\bm{u})=||\bm{A}\bm{w}||_{2}^{2}+\frac{\rho}{2}\,\big{|}\big{|}\bm{w}-\bm{z}+\bm{u}\big{|}\big{|}_{2}^{2} (10)

where ฯ>0\rho>0 is the penalty parameter and the soft thresholding operator SS is defined as

Sฮบโ€‹(a)={aโˆ’ฮบa>ฮบ0|a|โ‰คฮบa+ฮบa<โˆ’ฮบ\displaystyle S_{\kappa}(a)=\begin{dcases}a-\kappa&a>\kappa\\ 0&|a|\leq\kappa\\ a+\kappa&a<-\kappa\\ \end{dcases} (11)
Input: ๐‘จ\bm{A}, ฮป\lambda, ฯ\rho, ๐’˜0\bm{w}_{0}, ๐’›0\bm{z}_{0}, ๐’–0\bm{u}_{0}
Output: ๐’˜\bm{w}
function SolveADMM(๐‘จ\bm{A}, ฮป\lambda, ฯ\rho, ๐’˜0\bm{w}_{0}, ๐’›0\bm{z}_{0}, ๐’–0\bm{u}_{0})
ย ย ย ย ย ย  Initialization: ๐’˜โ†๐’˜0,๐’›โ†๐’›0,๐’–โ†๐’–0\bm{w}\leftarrow\bm{w}_{0},\bm{z}\leftarrow\bm{z}_{0},\bm{u}\leftarrow\bm{u}_{0};
ย ย ย ย ย ย  whileย Not Convergeย do
ย ย ย ย ย ย ย ย ย ย ย ย  ๐’˜โ†(2โ€‹๐‘จTโ€‹๐‘จ+ฯโ€‹I)โˆ’1โ‹…ฯโ€‹(๐’›โˆ’๐’–)\bm{w}\leftarrow\big{(}2\bm{A}^{T}\bm{A}+\rho I\big{)}^{-1}\cdot\rho\left(\bm{z}-\bm{u}\right);
ย ย ย ย ย ย ย ย ย ย ย ย  ๐’˜โ†๐’˜/โ€–๐’˜โ€–2\bm{w}\leftarrow\bm{w}/||\bm{w}||_{2};
ย ย ย ย ย ย ย ย ย ย ย ย  ๐’›โ†Sฮป/ฯโ€‹(๐’˜+๐’–)\bm{z}\hskip 2.85pt\leftarrow S_{\lambda/\rho}\big{(}\bm{w}+\bm{u}\big{)};
ย ย ย ย ย ย ย ย ย ย ย ย  ๐’–โ†๐’–+๐’˜โˆ’๐’›\bm{u}\hskip 2.3pt\leftarrow\bm{u}+\bm{w}-\bm{z};
ย ย ย ย ย ย ย ย ย ย ย ย 
ย ย ย ย ย ย  end while
ย ย ย ย ย ย 
end function
Algorithmย 1 Solving the constrained Lasso optimization problem using ADMM

To find the minimizer ๐’˜k\bm{w}_{k} in the first step of the ADMM algorithm above (in Equation 9), we first compute the gradient of the function โ„’ฯโ€‹(๐’˜,๐’›kโˆ’1,๐’–kโˆ’1)\mathcal{L}_{\rho}(\bm{w},\bm{z}_{k-1},\bm{u}_{k-1}) with respect to ๐’˜\bm{w}, set it to zero, and then normalize the resulting vector solution:

0=\displaystyle 0= โˆ‡๐’˜โ„’ฯโ€‹(๐’˜,๐’›kโˆ’1,๐’–kโˆ’1)|๐’˜=๐’˜k\displaystyle\,\nabla_{\bm{w}}\mathcal{L}_{\rho}(\bm{w},\bm{z}_{k-1},\bm{u}_{k-1})\big{|}_{\bm{w}=\bm{w}_{k}} (12)
=\displaystyle= โ€‰2โ€‹๐‘จTโ€‹๐‘จโ€‹๐’˜k+ฯโ€‹(๐’˜kโˆ’๐’›kโˆ’1+๐’–kโˆ’1)\displaystyle\,2\bm{A}^{T}\bm{A}\bm{w}_{k}+\rho\left(\bm{w}_{k}-\bm{z}_{k-1}+\bm{u}_{k-1}\right)

It follows that

๐’˜k\displaystyle\bm{w}_{k} =(2โ€‹๐‘จTโ€‹๐‘จ+ฯโ€‹I)โˆ’1โ‹…ฯโ€‹(๐’›kโˆ’1โˆ’๐’–kโˆ’1)\displaystyle=\big{(}2\bm{A}^{T}\bm{A}+\rho I\big{)}^{-1}\cdot\rho\left(\bm{z}_{k-1}-\bm{u}_{k-1}\right) (13)
๐’˜k\displaystyle\bm{w}_{k} =๐’˜k/โ€–๐’˜kโ€–2\displaystyle=\bm{w}_{k}/||\bm{w}_{k}||_{2}

Algorithm 1 outlines the overall process for solving the constrained Lasso optimization problem in Equation 6, for a given matrix ๐‘จ\bm{A}, regularization parameter ฮป\lambda, penalty parameter ฯ\rho, and initial guesses ๐’˜0\bm{w}_{0}, ๐’›0\bm{z}_{0}, ๐’–0\bm{u}_{0}.

3.3.2 Finding the appropriate sets of basis functions using genetic programming

Now that we have presented Algorithm 1 that solves the constrained Lasso optimization problem in Equation 6 for particular sets of basis functions, we go through our procedure for finding the optimal sets of basis functions (and hence, the optimal analytical functions fโ€‹(โ‹…)f(\cdot) and gโ€‹(โ‹…)g(\cdot)).

Encoding Scheme
Most of the SR methods rely on expression trees in their implementation. That is, each mathematical expression is represented by a tree where nodes (including the root) encode arithmetic operations (e.g. ++, โˆ’-, ร—\times, รท\div) or mathematical functions (e.g. cos\cos, sin\sin, exp\exp, ln\ln), and leaves contain the independent variables (i.e. x1,โ€ฆ,xdx_{1},\ldots,x_{d}) or constants. Inspired by luo2012parse; chen2017elite, we use matrices instead of trees to represent the basis functions. However, we propose our own encoding scheme that we believe is general enough to handle/recover a wide range of expressions.

We introduce the basis matrices ๐‘ฉฯ•\bm{B}^{\phi} and ๐‘ฉฯˆ\bm{B}^{\psi} to represent the basis functions ฯ•โ€‹(โ‹…)\phi(\cdot) and ฯˆโ€‹(โ‹…)\psi(\cdot) used with the feature vector ๐ฑ\mathbf{x} and the target variable yy, respectively. The basis matrices ๐‘ฉฯ•\bm{B}^{\phi} and ๐‘ฉฯˆ\bm{B}^{\psi} are of sizes n๐‘ฉฯ•ร—m๐‘ฉฯ•n_{\bm{B}^{\phi}}\times m_{\bm{B}^{\phi}} and n๐‘ฉฯˆร—1n_{\bm{B}^{\psi}}\times 1 respectively (i.e. ๐‘ฉฯˆ\bm{B}^{\psi} is a column vector), and take the form

๐‘ฉฯ•=[b1,1ฯ•โ€ฆb1,m๐‘ฉฯ•ฯ•โ‹ฎโ‹ฏโ‹ฎbn๐‘ฉฯ•,1ฯ•โ€ฆbn๐‘ฉฯ•,m๐‘ฉฯ•ฯ•],๐‘ฉฯˆ=[b1,1ฯˆโ‹ฎbn๐‘ฉฯˆ,1ฯˆ],\displaystyle\bm{B}^{\phi}=\begin{bmatrix}b_{1,1}^{\phi}&\ldots&b^{\phi}_{1,m_{\bm{B}^{\phi}}}\\ \vdots&\cdots&\vdots\\ b_{n_{\bm{B}^{\phi}},1}^{\phi}&\ldots&b_{n_{\bm{B}^{\phi}},m_{\bm{B}^{\phi}}}^{\phi}\\ \end{bmatrix},\qquad\bm{B}^{\psi}=\begin{bmatrix}b_{1,1}^{\psi}\\ \vdots\\ b_{n_{\bm{B}^{\psi}},1}^{\psi}\\ \end{bmatrix}, (14)

where the entries bi,jฯ•b^{\phi}_{i,j} and bi,1ฯˆb^{\psi}_{i,1} are all integers. The first column bโˆ™,1ฯ•b^{\phi}_{{\color[rgb]{.5,.5,.5}\bullet},1} of ๐‘ฉฯ•\bm{B}^{\phi} and the first (and only) column bโˆ™,1ฯˆb^{\psi}_{{\color[rgb]{.5,.5,.5}\bullet},1} of ๐‘ฉฯˆ\bm{B}^{\psi} indicate the mathematical function (or transformation) to be applied (on the input feature vector ๐ฑ\mathbf{x} and the target variable yy, respectively). In ๐‘ฉฯ•\bm{B}^{\phi}, the second column bโˆ™,2ฯ•b^{\phi}_{{\color[rgb]{.5,.5,.5}\bullet},2} specifies the type of argument (see Table 3), and the remaining nv=m๐‘ฉฯ•โˆ’2n_{v}=m_{\bm{B}^{\phi}}-2 columns bโˆ™,3ฯ•,โ‹ฏ,bโˆ™,m๐‘ฉฯ•ฯ•b^{\phi}_{{\color[rgb]{.5,.5,.5}\bullet},3},\cdots,b^{\phi}_{{\color[rgb]{.5,.5,.5}\bullet},m_{\bm{B}^{\phi}}} indicate which independent variables (or features) are involved (i.e. the active operands). The quantity nvn_{v} represents the maximum total multiplicity of all the independent variables included in the argument. Note that n๐‘ฉฯ•n_{\bm{B}^{\phi}} and n๐‘ฉฯˆn_{\bm{B}^{\psi}} specify the number of transformations to be multiplied together (i.e. each basis function ฯ•โ€‹(โ‹…)\phi(\cdot) and ฯˆโ€‹(โ‹…)\psi(\cdot) will be a product of n๐‘ฉฯ•n_{\bm{B}^{\phi}} and n๐‘ฉฯˆn_{\bm{B}^{\psi}} transformations, respectively).

The encoding/decoding process happens according to a table of mapping rules that is very straightforward to understand and employ. For instance, consider the mapping rules outlined in Table 3, where dd is the dimension of the input feature vector. As we will see in our numerical experiments in Section LABEL:section4, we will adopt this table for many SR benchmark problems. Other mapping tables are defined according to different benchmark problems111Each SR benchmark problem uses a specific set (or library) of allowable mathematical functions (e.g. cos\cos, sin\sin, exp\exp, log\log), and hence, we mainly modify the first two rows of the mapping tables. (more details about the SR benchmark problem specifications can be found in Appendix LABEL:appendixspecifications). The encoding from the analytical form of a basis function to the basis matrix is straightforward. For example, for d=3d=3, n๐‘ฉฯ•=4n_{\bm{B}^{\phi}}=4 and m๐‘ฉฯ•=5m_{\bm{B}^{\phi}}=5 (i.e. nv=3n_{v}=3), the basis function ฯ•โ€‹(๐ฑ)=x2โ€‹cosโก(x12โ€‹x2)โ€‹lnโก(x1+x3)\phi(\mathbf{x})=x_{2}\cos(x_{1}^{2}x_{2})\ln(x_{1}+x_{3}) can be generated according to the encoding steps shown in Table 3.

Based on the mapping rules in Table 3 and the encoding steps in Table 3, the basis function ฯ•โ€‹(๐ฑ)=x2โ€‹cosโก(x12โ€‹x2)โ€‹lnโก(x1+x3)\phi(\mathbf{x})=x_{2}\cos(x_{1}^{2}x_{2})\ln(x_{1}+x_{3}) can be described by a 4ร—54\times 5 matrix as follows:

๐‘ฉฯ•\displaystyle\bm{B}^{\phi} =[102โˆ™โˆ™22112511300โˆ™โˆ™โˆ™โˆ™]\displaystyle=\begin{bmatrix}1&0&2&{\color[rgb]{.5,.5,.5}\bullet}&{\color[rgb]{.5,.5,.5}\bullet}\\ 2&2&1&1&2\\ 5&1&1&3&0\\ 0&{\color[rgb]{.5,.5,.5}\bullet}&{\color[rgb]{.5,.5,.5}\bullet}&{\color[rgb]{.5,.5,.5}\bullet}&{\color[rgb]{.5,.5,.5}\bullet}\\ \end{bmatrix} (15)
Table 2: Example table of mapping rules for a basis function. The identity operator is denoted by โˆ™1{\color[rgb]{.5,.5,.5}\bullet}^{1}.
bโˆ™,1b_{{\color[rgb]{.5,.5,.5}\bullet},1} 0 1 2 3 4 5
Transformation (TT) 1 โˆ™1{\color[rgb]{.5,.5,.5}\bullet}^{1} cos\cos sin\sin exp\exp ln\ln
bโˆ™,2b_{{\color[rgb]{.5,.5,.5}\bullet},2} 0 1 2
Argument Type (aโ€‹rโ€‹garg) xx โˆ‘\sum โˆ\prod
bโˆ™,3,โ‹ฏ,bโˆ™,m๐‘ฉb_{{\color[rgb]{.5,.5,.5}\bullet},3},\cdots,b_{{\color[rgb]{.5,.5,.5}\bullet},m_{\bm{B}}} 0 1 2 3 โ‹ฏ\cdots dd
Variable (vv) skip x1x_{1} x2x_{2} x3x_{3} โ‹ฏ\cdots xdx_{d}
Table 3: Encoding steps corresponding to the
basis function ฯ•โ€‹(๐ฑ)=x2โ€‹cosโก(x12โ€‹x2)โ€‹lnโก(x1+x3)\phi(\mathbf{x})=x_{2}\cos(x_{1}^{2}x_{2})\ln(x_{1}+x_{3}).
Step TT aโ€‹rโ€‹garg v1v_{1} v2v_{2} v3v_{3} Update
1 โˆ™1{\color[rgb]{.5,.5,.5}\bullet}^{1} xx x2x_{2} โ€” โ€” T1โ€‹(๐ฑ)=x2T_{1}(\mathbf{x})=x_{2}
2 cos\cos โˆ\prod x1x_{1} x1x_{1} x2x_{2} T2โ€‹(๐ฑ)=cosโก(x12โ€‹x2)T_{2}(\mathbf{x})=\cos(x_{1}^{2}x_{2})
3 ln\ln โˆ‘\sum x1x_{1} x3x_{3} skip T3โ€‹(๐ฑ)=lnโก(x1+x3)T_{3}(\mathbf{x})=\ln(x_{1}+x_{3})
4 11 โ€” โ€” โ€” โ€” T4โ€‹(๐ฑ)=1T_{4}(\mathbf{x})=1
Final Update: โ€‰ฯ•โ€‹(๐ฑ)=T1โ€‹(๐ฑ)โ‹…T2โ€‹(๐ฑ)โ‹…T3โ€‹(๐ฑ)โ‹…T4โ€‹(๐ฑ)\phi(\mathbf{x})=T_{1}(\mathbf{x})\cdot T_{2}(\mathbf{x})\cdot T_{3}(\mathbf{x})\cdot T_{4}(\mathbf{x})
Remark 3.1.

In Table 3, โ€” denotes entries that are ignored during the construction of the basis function. The argument type in Step 1 is xx which implies that we only select the first variable (encoded by bโˆ™,3b_{{\color[rgb]{.5,.5,.5}\bullet},3}) out of the nvn_{v} variables as an argument, and hence the entries corresponding to v2v_{2} and v3v_{3} are ignored. Similarly, the transformation in Step 4 is T=1T=1 which implies that the argument type and the nvn_{v} variables are all ignored. These are the only two cases where some entries are ignored during the construction process. The same ignored entries are reflected in the matrix ๐‘ฉฯ•\bm{B}^{\phi} using โˆ™{\color[rgb]{.5,.5,.5}\bullet}. More encoding examples can be found in Appendix LABEL:moreencodexamples.