LCEN: A Novel Feature Selection Algorithm for Nonlinear, Interpretable Machine Learning Models

Pedro Seber
Department of Chemical Engineering
Massachusetts Institute of Technology
Cambridge, MA 02139
[email protected]
&Richard D. Braatz
Department of Chemical Engineering
Massachusetts Institute of Technology
Cambridge, MA 02139
[email protected]

Abstract

Interpretable architectures can have advantages over black-box architectures, and interpretability is essential for the application of machine learning in critical settings, such as aviation or medicine. However, the simplest, most commonly used interpretable architectures, such as LASSO or elastic net (EN), are limited to linear predictions and have poor feature selection capabilities. In this work, we introduce the LASSO-Clip-EN (LCEN) algorithm for the creation of nonlinear, interpretable machine learning models. LCEN is tested on a wide variety of artificial and empirical datasets, frequently creating more accurate, sparser models than other architectures, including those for building sparse, nonlinear models. LCEN is robust against many issues typically present in datasets and modeling, including noise, multicollinearity, data scarcity, and hyperparameter variance. LCEN is also able to rediscover multiple physical laws from empirical data and, for processes with no known physical laws, LCEN achieves better results than many other dense and sparse methods – including using 10.8-fold fewer features than dense methods and 8.1-fold fewer features than EN on one dataset, and is comparable to or better than ANNs on multiple datasets.

1 Introduction

Statistical models are powerful tools to explain, predict, or describe natural phenomena [1]. They connect independent variables (also called “inputs” or “features”) to dependent variables (also called “outputs” or “labels”) to test causal hypotheses, predict novel outputs from known inputs, or summarizing the data structure [1]. Many model architectures exist, including linear, ensemble-based, and deep learning models. Complex architectures are claimed to have greater capability to model phenomena due to their lower bias, but their intricate and numerous mathematical transformations prevent humans from understanding how an output was predicted by a model, or the relative or absolute importance of the inputs. Moreover, a lack of transparency may prevent the model from being trusted in critical or sensitive applications [2].

As summarized by Ref. [3], there are two main methods to increase interpretability: the use of model-agnostic algorithms, which extract interpretable explanations a posteriori and work for any architecture, or the direct use of interpretable architectures. Interpretable architectures include “decision trees, rules, additive models, attention-based networks, and sparse linear models” [3]. It should be noted that nonlinear models may also be made sparse, and even interpretable, as described later in this section and the rest of this work. As elaborated in the review of Ref. [4], interpretable architectures can have many advantages over black-box or a posteriori explanations, including the ability to assist researchers in refining the model and data, or better highlighting scenarios in which the model fails or lacks robustness. Special attention should be given to sparse models, which identify the most important features, can make the model more robust to variations in the input data, and can significantly improve the model’s interpretability if an interpretable architecture is used [4]. At the same time, even a linear model or decision tree/rules can become unwieldy and uninterpretable if hundreds or thousands of coefficients or rules are present.

Feature selection is the process of selecting the most important features in a model to increase its robustness or interpretability. Many criteria for feature selection exist [5], including significance based on p-values (using a univariate, iterative/stepwise, or global method), using information criteria (such as the AIC [6] and BIC [7]), using penalties (such as in LASSO [8, 9] and elastic net [EN] [10]), criteria based on changes in estimates, and expert knowledge. More broadly, these methods can be classified as filter, wrapper, or integrated methods. While no method is superior for all problems, different works have evaluated and criticized these criteria. For example, stepwise regression is one of the most commonly used methods in many fields thanks to its computational simplicity and ease of understanding [11, 5, 12]. However, stepwise regression is prone to ignoring features with causal effects, including irrelevant features, generating excessively small confidence intervals, and producing incorrect/biased parameters [11, 12]. LASSO is simple and computationally cheap, and has performed well for some problems [13, 14, 15], but can overselect irrelevant variables, can select only as many features as there are samples, and does not handle multicollinear data well [5, 10].¹¹1This last point is somewhat controversial in the literature, see Refs. [13, 16], for example.

Originally, most feature selection methods applied only in linear contexts (or have been applied primarily in linear contexts). For example, the only sparse models referenced in the highly cited review by Ref. [3] are linear. The main sparse architectures (LASSO, EN, and their variants) are linear regressors.²²2Ref. [17] has claimed that, if thousands of samples are available, LASSO can consistently select features even in nonlinear settings, although the coefficients may be distorted. However, this consistency is disputed by Ref. [18]. Later works consider sparse nonlinear models, to address the fact that many natural and industrial processes are nonlinear. Refs. [19, 20, 21] defined sets of features consisting of polynomials (all works), interactions (all works), and/or non-polynomials [19, 21]. ALVEN, the model architecture from Ref. [21], uses an F-test for each feature (including the expanded set of features) to determine whether to keep a feature in the final EN model, a filter approach. However, this F-test has very poor feature selectivity, as nearly all features are selected when traditional values of $\alpha$ ( $0.001\leq\alpha\leq 0.05$ ) are used. Furthermore, the ordering of the features with respect to their p-values does not follow their relevance, as many irrelevant features are among those with the lowest p-values, and relevant features can be among those with the highest p-values (including p $\gg 0.05$ ). The work of Ref. [22] uses two forms of modified LASSO algorithms for nonlinear feature selection, achieving high sparsity with the datasets tested.

More recently, the L₁ regularization has been applied to neural networks for nonlinear feature selection. In its simplest form, group LASSO is applied to zero all the outputs of some neurons (sparsifying the network and eliminating features when zeroing input-layer neurons) [18, 23, 24]. A slightly modified version of this algorithm applies the L₁ penalty only to the input layer and includes a skip-connection between that layer and the output layer [25]. More complex applications of this method include the multi-modal neural networks of Ref. [26], the concrete autoencoders of Ref. [27], and the teacher-student network of Ref. [28]. The first two are notable for being at least partially unsupervised methods, suggesting they can select the most relevant features for a given dataset no matter the task. Some works have also used approaches other than the L₁ norm, such as the L₀ norm [29]. While these models are powerful tools for feature selection, two considerable limitations are present: first, they do not provide any information on how the selected features are contributing to the final prediction, significantly limiting interpretability. A posteriori methods to extract this information have been found unreliable [4], even if useful. Second, these complex model architectures may take “shortcuts” to make apparently accurate predictions [30, 31]. However, these “shortcuts” are not really relevant to the task, preventing proper generalization and human interpretation.

To create nonlinear, interpretable machine learning models with high predictive and descriptive power, we propose the LASSO-Clip-EN (LCEN) algorithm. This algorithm generates an expanded set of nonlinear features (such as in ALVEN) and performs feature selection and model fitting. This feature set expansion, together with the Clip step, provide LCEN with the ability to do nonlinear predictions. The algorithm is tested on artificial and empirical data, successfully rediscovering physical laws from data belonging to multiple different areas of knowledge with errors < 2% on the coefficients, a value within the empirical noise of the datasets. On datasets from processes whose underlying physical laws are not yet known, LCEN attains lower RMSEs than many sparse and dense methods, leads to sparser models than any other architecture tested, and simultaneously runs faster than most alternative architectures.

2 Methods

2.1 Datasets

Multiple datasets are used to test the performance of the LCEN algorithm (summarized in Table A1). These datasets can be divided into three categories: artificial data [“Linear, 5-variable polynomial”, “Multicollinear data”, “Relativistic energy”, and “4th-degree, univariate polynomial”], empirical data from processes with known physical laws [“CARMENES star data” and “Kepler’s 3rd Law”], and empirical data from processes with no known physical laws [“Diesel Freezing Point”, “Abalone”, “Concrete Compressive Strength”, “Boston housing”, and “GEFCom 2014”]. The artificial data generated by us are used for an initial assessment of the LCEN algorithm and to investigate how properties of the data, such as noise or data range, affect its feature selection capabilities. Empirical data from processes with known physical laws are used to verify whether the LCEN algorithm can rediscover known physical laws using data with real properties. Empirical data from processes with no known physical laws are used to compare the performance of the LCEN algorithm against other linear and nonlinear models, including deep learning models. Further description of these datasets and how the artificial datasets are generated is available in Section A2.

All models tested in this work had their hyperparameters selected by 5-fold cross-validation. The separation between training and testing sets varied depending on the dataset. None of the artificial datasets or datasets containing empirical data from processes with known physical laws have a separate test set, as they are used to investigate the capability of the LCEN algorithm to select correct features (which occurs based on the training set). For the “Diesel freezing point” dataset, 30% of the dataset was randomly separated to form the test set. For the “Abalone” dataset, the last 1,044 entries (25%) were used as the test set as per Refs. [32, 33]. For the “Concrete Compressive Strength” dataset, 25% of the dataset was randomly separated to form the test set as per Ref. [34]. For the “Boston housing” dataset, 20% of the dataset was randomly separated to form the test set. For the “GEFCom 2014” dataset, the data from task 1 were used as the training set and all data from tasks 2–15 were used as the test set.

2.2 Algorithm

The LCEN algorithm has five hyperparameters: alpha, which determines the regularization strength (as in the LASSO, EN, and similar algorithms); l1_ratio, which determines how much of the regularization of the EN step depends on the 1-norm as opposed to the 2-norm (as in the EN algorithm); degree, which determines the maximum degree for the basis expansion of the data (Table A2); lag, which determines the maximum number of previous time steps from which features are included (relevant only for dynamic models); and cutoff, which determines the minimum value a scaled parameter needs to have to not be eliminated during the clip step. Details on the cross-validated hyperparameter values for all models are in Section A3.

The LCEN algorithm (formally typeset as Algorithm 1) begins with the LASSO step, setting the l1_ratio to 1. Cross-validation on the training set is employed among all combinations of alpha, degree, and lag values. The values of degree and lag corresponding to the model with the lowest validation MSE are recorded, and parameters are obtained for this LASSO model. The next step in the LCEN algorithm is the clip step, in which parameters smaller than the cutoff are set to 0. Finally, the EN step involves cross-validation on the training set among all combinations of alpha and l1_ratio, using the values of degree and lag obtained in the LASSO step. A final, nonlinear, and interpretable model, whose coefficients are subjected to the clip step again, is returned.

3 Results

3.1 Artificial data highlight LCEN’s robustness to noise, multicollinearity, and hyperparameter variance

The first dataset used to validate the LCEN algorithm is a simple linear polynomial with five independent variables (“Linear, 5-variable polynomial”). This polynomial is of the form $y=-2.8X_{0}-2.7X_{1}-5.3X_{2}+4.3X_{3}+9.0X_{4}+\epsilon$ . Datasets were created with increasing noise levels to determine how LCEN performs. For all noise levels, LCEN returned the correct polynomial among nonlinear and non-polynomial features with degree $\leq 3$ (Table 1). Furthermore, the coefficients did not display any appreciable change despite the increasing noise. This analysis highlights how LCEN is robust to noise in a scenario where all features are independent, even if the hyperparameters allow for extraneous features to potentially be selected by the algorithm.

Table 1: Predicted coefficients and RMSEs to the true coefficients for the “Linear, 5-variable polynomial” dataset at different noise levels. The coefficients are ordered such that the

n

th value corresponds to the coefficient of

X_{n}

Noise Coefficients RMSE 0% -2.80, -2.70, -5.30, 4.30, 9.00 3.1 $\times 10^{-7}$ 59.6% -2.82, -2.70, -5.27, 4.30, 9.01 1.5 $\times 10^{-2}$ 119% -2.83, -2.70, -5.24, 4.30, 9.01 3.0 $\times 10^{-2}$ 238% -2.86, -2.70, -5.19, 4.29, 9.03 5.9 $\times 10^{-2}$

The first step of the LCEN algorithm uses LASSO, which has been claimed to underperform with multicollinear data [5, 10]. Therefore, tests using multicollinear data are done next. The goal is to verify whether LCEN can successfully determine the presence of two different but correlated variables, as LASSO prefers to select only one variable in this scenario [10]. Noise $\epsilon_{1}$ , at different levels, was added to the $X_{0}$ variable to create a correlated variable $X_{1}$ . A second noise $\epsilon_{2}$ , also at different levels, was added to the final $y$ data. When $\epsilon_{1}=0$ , both variables are equal and separation is not possible. However, at other $\epsilon_{1}$ values, the LCEN algorithm is very successful at identifying that two relevant variables exist and assigning correct coefficients to them (Figure A1). Specifically, when the noise level $\epsilon_{1}$ associated with the $X$ data (which indicates how different the $X$ variables are, as highlighted by the variance inflation factors [VIFs] in Fig. A1) is greater than the noise level $\epsilon_{2}$ associated with the $y$ data, LCEN can separate both variables with coefficient errors $\leq 5\%$ . When both noise levels are similar, LCEN can separate both variables with coefficient errors between 5% and 10%. The $X$ data used in this experiment has very high multicollinearity (as shown by the VIFs); real data will typically have lower VIFs and thus be easier to separate using LCEN.

Next, a more complex equation is used to further validate LCEN. The “Relativistic energy” data contain mass and velocity values used to calculate $E^{2}=c^{4}m^{2}+c^{2}m^{2}v^{2}$ . As before, datasets with increasing noise levels are created. The degree hyperparameter is allowed to vary between 1 and 6 in this experiment. LCEN selected only relevant features for all noise levels tested ( $\leq 20\%$ ), and the coefficients were typically equal to the ground truth (Table 2). The only major divergence happened at a noise level of 20%, as the coefficient for $m^{2}$ had a 25% relative error. This error led to our hypothesizing that it is challenging to distinguish among the features involving $m$ (such as $m$ , $m^{2}$ , and $m^{3}$ ) due to the low range of the data. Thus, another dataset with the same properties but a larger range of values for $m$ is created. LCEN performed better on this dataset, selecting only relevant features for all noise levels tested ( $\leq 30\%$ ) and having much lower errors in the estimated coefficients (Table 2). These experiments further highlight the robustness of LCEN and show how the range of the data can affect predictions. To clarify our design choices and the relevance of each individual part of the LCEN architecture, ablation tests are performed with this dataset (and other datasets) in Section A4 of the Appendix.

Table 2: Coefficient values and corresponding relative error to the ground truth for the “Relativistic energy” dataset at different noise levels. The first coefficient is for

m^{2}

and should be

c^{4}=8.078

\times

10^{33}

m⁴/s⁴; the second coefficient is for

m^{2}v^{2}

and should be

c^{2}=8.988

\times

10^{16}

m²/s². The left table is for the dataset with

1\leq m<10

, and the right table is for the dataset with

1\leq m<100

Noise Coefficients Error (%) 0% 8.077 $\times 10^{33}$ , 8.987 $\times 10^{16}$ 0.013, 0.009 5% 8.081 $\times 10^{33}$ , 8.969 $\times 10^{16}$ 0.043, 0.206 10% 8.085 $\times 10^{33}$ , 8.951 $\times 10^{16}$ 0.097, 0.410 15% 8.089 $\times 10^{33}$ , 8.935 $\times 10^{16}$ 0.139, 0.580 20% 6.027 $\times 10^{33}$ , 8.912 $\times 10^{16}$ 25.39, 0.844 Noise Coefficients Error (%) 0% 8.078 $\times 10^{33}$ , 8.987 $\times 10^{16}$ 0.001, 0.006 5% 8.078 $\times 10^{33}$ , 8.986 $\times 10^{16}$ 0.005, 0.022 10% 8.078 $\times 10^{33}$ , 8.984 $\times 10^{16}$ 0.009, 0.038 15% 8.079 $\times 10^{33}$ , 8.983 $\times 10^{16}$ 0.013, 0.054 20% 8.079 $\times 10^{33}$ , 8.981 $\times 10^{16}$ 0.017, 0.070 30% 8.080 $\times 10^{33}$ , 8.978 $\times 10^{16}$ 0.025, 0.103

Finally, LCEN is compared with the feature selection algorithm in ALVEN [21], which uses the same basis function expansion, but uses f-tests for feature selection. The “4th-degree, univariate polynomial” dataset is created as per Ref. [35], such that $y=X+0.5X^{2}+0.1X^{3}+0.05X^{4}+\epsilon$ , 30 $X$ points are available for training, and 1,000 $X$ points are available for testing. These conditions simulate the scarcity of data potentially present in real datasets while ensuring test errors can be predicted with high confidence. Ref. [35] created four types of ALVEN models for this prediction: one that always uses degree = 4 (“unbiased model”), one that always uses degree = 2 (“biased model”), one that selects a degree between 1 and 10 based on cross-validation (“cv”), and one that selects a degree equal to 2 or 4 based on cross-validation (“cv limited order”). Ref. [35] noted that the degree 4 “unbiased model” was the best at low noise levels, but its error quickly increases, leading to the degree 2 “biased model” becoming the best for noise levels > 75 (Figure 3 of Ref. [35]; reproduced with permission here as the left subfigure of Figure 1). The model with degree equal to 2 or 4 “cv limited order” was typically very close in performance to the best model at all noise levels, whereas the model with a degree between 1 and 10 “cv” had lower performance. Ref. [35] explains these observations with the bias-variance tradeoff: at low noise levels, models should follow the ground truth as closely as possible; thus, the degree 4 “unbiased model” was the best. However, at sufficiently high noise levels, it becomes impossible to obtain enough signal to compensate for the additional degrees of freedom (variance) in a 4th degree model; thus, the degree 2 “biased model” becomes the best. The degree between 1 and 10 “cv” model had lower performance due to its greater hyperparameter variance, and the degree equal to 2 or 4 “cv limited order” model struck a balance between the “unbiased model” and the “biased model”.

On this same dataset and using the same four types of models, LCEN attained median errors that are typically over 60% smaller than for ALVEN (Figure 1). Similarly to the models generated using ALVEN, the model with a degree between 1 and 10 “cv” had the lowest performance and the model with degree equal to 2 or 4 “cv limited order” had a performance between the degree 4 “unbiased model” and the degree 2 “biased model”. However, the degree 4 “unbiased model” was always the best model, no matter the noise level used. We attribute this considerable reduction in median test MSEs and the superiority of the degree 4 “unbiased model” created by LCEN to the improved feature selection algorithm, which is able to better resist variance due to noise and a large number of hyperparameters. This is corroborated by how the model with a degree between 1 and 10 “cv” tended to select degree = 4 at lower noise levels and degree = 2 at higher noise levels (Figure A3), showing how LCEN can automatically follow the bias-variance tradeoff hypothesis.

Refer to caption — Figure 1: Test set median MSE for the “4th-degree, univariate polynomial” dataset. ALVEN results (left, reproduced from Ref. [35] with permission) show that the error is monotonically increasing with noise and that the degree 4 “unbiased model” is the best at low noise levels, but is displaced by the degree 2 “biased model” at higher noise levels. On the other hand, LCEN results (right) show that the median errors converge at higher noises. Furthermore, the LCEN median errors are typically over 60% smaller than the ALVEN median errors, and the degree 4 “unbiased model” is always the best model no matter the noise. The “noise level” and “Noise variance $\sigma^{2}$ ” terms are equivalent in this figure. Interquartile ranges for the LCEN model’s test MSEs are available in Fig. A2.

3.2 LCEN surpasses many other architectures when making predictions on empirical data

The applicability of an algorithm to real-world problems is judged only by its performance on real data, as data sparsity or real noise may affect the algorithm’s capabilities. Tests done on empirical data generated by processes with known physical laws show that LCEN still displays exceptional feature selection capabilities, consistently selecting only the right features even when high hyperparameter variance is present (Section A6.1 and Table 3).

Table 3: Summary of LCEN results for the empirical datasets from processes with known physical laws.

Dataset Only Correct Features Relative Error of the Predicted Coefficient CARMENES star data Yes 1.74% Kepler’s 3rd Law (1619) Yes 0.46% Kepler’s 3rd Law (Modern) Yes 0.07%

The final experiments to validate LCEN’s performance involve comparisons to other algorithms on real datasets from processes with unknown physical laws. As there is no (computational) way to validate the feature selection by models trained on these datasets, the main focuses of this section are investigating prediction errors and sparsities of different model architectures. The linear methods ordinary least squares (OLS), ridge regression (RR) [36], partial least squares (PLS) [37, 38], LASSO, and elastic net (EN); the nonlinear ensemble methods random forest (RF) [39], gradient-boosted decision trees (GBDT) [40], adaptive boosting (AdaB) [41]; and the nonlinear methods support vector machine with radial-basis functions (SVM) [42], multilayer perceptron (MLP), MLP with group LASSO (MLP-GL₁) [23], and LassoNet [25] were compared with LCEN. To clarify our design choices and the relevance of each individual part of the LCEN architecture, ablation tests are performed with many of the datasets tested here in Section A4 of the Appendix.

The first dataset analyzed is the “Diesel Freezing Point” dataset [43], which is comprised of 395 diesel spectra measured at 401 wavelengths and used to predict the freezing point of these diesels. The dense, nonlinear architectures SVM and MLP had the best prediction performance, with test RMSEs equal to 4.39 and 4.61 ^∘C respectively (Table 4). They were followed by RR, EN, LCEN, LassoNet, LASSO, and MLP-GL₁, which had test RMSEs between 4.83 and 4.92 ^∘C. This set of architectures include both linear and nonlinear methods, and both dense and sparse methods. Other model architectures performed worse, with test RMSEs > 5.0^∘C. For comparison, 5.0^∘C is about 7.5% of the range of the test data, which contains diesels with freezing points between $-59.5^{\circ}$ C and $6.6^{\circ}$ C. The sparsest architecture was LCEN, which selected only 36/401 features (9.0%) yet had a prediction error only 10.0% higher than that of the best dense method. LCEN and SVM were the only nonlinear methods that had a runtime faster than 10 seconds, a speed typically reserved for linear methods. EN and AdaB had runtimes $\approx$ 20 seconds, and all other models had runtimes on the order of hundreds or thousands of seconds. LCEN is the only architecture that combines a low test RMSE, interpretability, and a fast runtime. LASSO, the only other model architecture that also has these properties, has a worse RMSE and sparsity. Finally, the LCEN cutoff hyperparameter was increased from the value that minimizes the validation MSE to create sparser models. These models have much fewer variables, yet their test set RMSEs typically increase by only small values. This illustrates how LCEN can select the most critical features to make models with high sparsity and predictive power, and how these criteria can be prioritized by the end-user.

Table 4: Results of different model architectures for the “Diesel Freezing Point” dataset.

Architecture Test RMSE (^∘C) Features Runtime (s) OLS 11.75 401 0.09 PLS 5.21 401 4.44 RR 4.83 401 4.87 EN 4.83 299 19.6 LASSO 4.90 39 2.06 RF 5.16 390 307 GBDT 5.40 354 2649 AdaB 5.60 300 22.0 SVM 4.39 401 5.33 MLP 4.61 401 121 MLP-GL₁ 4.92 80 569 LassoNet 4.83 401 474 LCEN 4.83 36 6.54 4.91 29 6.27 5.52 13 5.76 7.40 6 5.53

The next dataset used is the “Abalone” dataset [44]. Abalone (Haliotis sp.) are sea snails whose age can be determined by cutting their shells, staining them, and counting the stained shell rings under a microscope. This process is laborious and error-prone. An alternative is to estimate the number of rings based on readily available physical characteristics, such as weight and size. As before, LCEN was compared with other dense and sparse machine learning models, and LCEN models with increased sparsity were also generated (Table A7). In this problem, OLS, PLS, RR, LASSO, and EN all converged to the OLS solution (that is, no regularization), selecting all 8 linear features and having an RMSE of 2.1 rings. On the other hand, LCEN automatically detected that 2nd degree features would be relevant. The model with the LCEN algorithm also selected all 8 features and had an RMSE of 2.0 rings. Nonlinear models had test RMSEs between 2.0 and 2.3 rings, but they all lack the interpretability of LCEN. By increasing the cutoff hyperparameter, sparser LCEN models may be generated. An LCEN model with only 3 features had an RMSE of 2.1 rings, and another with only 2 features had an RMSE of 2.2 rings. This experiment further illustrates LCEN’s robust feature selection, and how very sparse LCEN models retain significant performance.

LCEN is then tested on the “Concrete Compressive Strength” dataset [34], which contains the composition and age of 1,030 different types of concrete and their compressive strengths. The relationship between these properties is nonlinear, and previous modeling attempts include algebraic expressions and artificial neural networks (specifically, MLPs) [34, 45]. These MLPs were superior to the algebraic models, whereas the algebraic models provide interpretability on how the properties of the concrete affect its compressive strength. LCEN is also considerably better than the previously published algebraic models (Table 5), and its performance is competitive with that of previously published ANNs without sacrificing interpretability. Furthermore, note that no type of validation is mentioned in [34], so the test and validation sets may be the same, making the MLP figures overoptimistic. LCEN has a validation RMSE of 4.66 MPa on this dataset, which is slightly better than that from the MLP of [34]. Some other machine-learning models surpass LCEN in terms of test RMSE, but LCEN has the lowest validation RMSE out of all architectures tested.

Table 5: Results of different model architectures for the “Concrete Compressive Strength” dataset. All machine-learning models selected all 8 features. No type of validation is mentioned in Ref. [34], so the test and validation sets may be the same, making the ANN values overoptimistic.

Architecture Test RMSE (MPa) Algebraic expression [34] 7.79 MLP [34] 4.76 Linear+interactions model [45] 7.43 RF 5.10 GBDT 7.29 AdaB 6.95 SVM 5.94 MLP-GL₁ 5.47 LassoNet 5.53 LCEN 5.73

LCEN is also successful at predicting phenomena caused by human activity instead of physical laws. In the modified “Boston housing” dataset [46], LCEN attains a test RMSE that is only 6% higher than that of a dense MLP (Table A8). Once again, LCEN attains higher performance than many other architectures in this dataset while also being completely interpretable. Finally, the “GEFCom 2014” dataset was used to highlight the ability of LCEN to predict in a complex and dynamic task [47]. Two versions of the “GEFCom 2014” dataset have been published: one that contains only energy consumption levels and another that contains the same energy consumption data and also temperature data from multiple weather stations. This work uses the former. “GEFCom 2014” is part of an energy forecasting competition which was originally won by a LASSO-like model [47]. More recently, deep learning has been applied to this problem [48, 49], and deep learning models have achieved strong 24-hour predictive performance [49]. Despite the strong performance of multiple, complex ANN architectures, LCEN models obtain a 13.1% lower test RMSE on this forecasting task than the state-of-the-art Seq2Seq model from Ref. [49] (Table 6). Unlike the ANNs, LCEN requires only a CPU for training and forecasting, and provides interpretable coefficients. LCEN can also be used for longer forecasts without significant increases in the prediction error, further highlighting the robustness of the algorithm.

Table 6: Results of different model architectures for the “GEFCom 2014” dataset. The deep learning models (TCN to Seq2Seq) and their results come directly from Ref. [49].

Metric TCN RNN LSTM GRU Seq2Seq LCEN Hours Forecast 24 24 24 24 24 24 48 72 120 168 Mean Test RMSE (MW) 17.2 18.0 19.5 19.0 17.1 14.9 18.9 21.0 23.4 24.7 Mean Relative Error (%) 9.8 10.2 11.1 10.8 9.7 8.5 10.7 11.9 13.2 13.9

4 Discussion

This work introduces the LASSO-Clip-EN (LCEN) algorithm for the creation of nonlinear, interpretable machine learning models (Algorithm 1). LCEN is first validated using artificial data (Section 3.1), which provide an initial assessment of the algorithm’s performance under multiple, independently controllable conditions. LCEN displayed high performance for data that are noisy (“Linear, 5-variable polynomial” and “4th-degree, univariate polynomial” datasets; Table 1 and Fig. 1), multicollinear (“Multicollinear data” dataset; Fig. A1), or generated from a process with a complex nonlinear function (“Relativistic energy” and “4th-degree, univariate polynomial” datasets; Table 2 and Fig. 1), which also leads to high hyperparameter and feature variance. For the “4th-degree, univariate polynomial” dataset, LCEN automatically adjusted its predictions to compensate for the bias-variance tradeoff (Figs. 1, A2, and A3). During the experiments with the “Relativistic energy” dataset, the range of the data affected the prediction performance – data with higher ranges led to more accurate predictions even at high noises. We hypothesize that this phenomenon occurs because it is difficult to distinguish between multiple powers of a variable (or, in a more generalized manner, how Taylor approximations are successful) when the data cover low ranges and are noisy.

LCEN was then tested with data from processes with known physical laws (Section A6.1). These data have real properties (such as noise) and may have limitations common to many processes (such as low amounts of data). Nevertheless, LCEN successfully selected the correct features for all datasets used in this work, effectively rediscovering physical laws solely from data (Table 3). The “CARMENES star data” dataset contains hundreds of points described by the Stefan-Boltzmann law. This law uses a 6th-order interaction, but since this would not be known a priori, LCEN was tested using all degree hyperparameters from 1 to 10. Despite the high variance of the process and the many potential features that could be selected, LCEN selected only the correct $R^{2}T^{4}$ feature with a coefficient relative error of only 1.74%, which is well within the 2–3% data error. Next, LCEN was evaluated for the “Kepler’s 3rd Law” datasets, which contain small amounts of data (only 6 or 8 points) yet were enough for Kepler to derive the eponymous law in 1619. LCEN can automatically replicate this discovery despite the slightly inaccurate data from Kepler’s measurements, selecting only the correct $a^{3/2}$ feature. The coefficient relative errors were 0.46% on Kepler’s original data and 0.07% on modern data, extremely small values that highlight the algorithm’s accuracy.

Lastly, experiments using empirical data from processes with unknown laws further validated LCEN’s feature selection and predictive performance. LCEN was compared to multiple different model architectures, including linear models, ensemble methods, and sparse and dense MLPs. On the “Diesel Freezing Point” dataset (Table 4), LCEN had a test RMSE equal to the RMSEs for RR and EN. However, LCEN required 11.1- and 8.3-fold fewer features than RR and EN (respectively) to reach the same accuracy, and LCEN was 3.0-fold faster than EN. LCEN also generated a more accurate and sparser model than LASSO, and LCEN models with increased sparsities typically displayed only small increases in the prediction error. When compared to SVM, the architecture with the lowest test RMSE, LCEN displayed only a 10.0% higher error, despite being a completely interpretable architecture and using only 36/401 features. LCEN provided a combination of accuracy, sparsity, interpretability, and speed that was unmatched by the other machine-learning methods. On the “Abalone” dataset (Table A7), LCEN automatically recognized the importance of 2nd-degree features, which led to its producing the model with the lowest RMSE. All linear methods converged to the same dense model. Nonlinear methods had similar performance, reaching test RMSEs between 2.0 and 2.3 rings, showing LCEN can provide interpretability without sacrificing accuracy. Finally, an LCEN model using only 3 features displayed only a 2.3% increase in RMSE relative to the best LCEN model, and another LCEN model using only 2 features displayed only a 7.1% increase in RMSE relative to the same best LCEN model. As before, LCEN built very sparse yet very accurate models. LCEN was then compared to previously published algebraic expressions and an ANN, and many other machine-learning methods, on the “Concrete Compressive Strength” dataset (Table 5). LCEN displayed a test RMSE at least 23% lower than that of the algebraic models and produces a model that is simpler and interpretable while having an RMSE competitive with that of the MLP of Ref. [34]. In the paper that originated this MLP, no type of validation is mentioned, which could indicate that the MLP results are overoptimistic. The validation performance of LCEN was better than the ANN test performance reported in Ref. [34], indicating LCEN is competitive even with ANNs on this problem if the results reported by Ref. [34] are actually validation results. Some nonlinear machine-learning architectures achieved a lower test RMSE on this task, but LCEN has the lowest validation RMSE out of all architectures tested.

Continuing these experiments, LCEN was also tested against datasets generated by human activity, as opposed to physical phenomena. In a dataset used to predict the prices of houses (“Boston housing”), LCEN achieved strong performance, with a test RMSE only 6% higher than that of an MLP, which had the best test-set performance. This exceptional result further highlights the potential of LCEN, especially in data-deficient scenarios. In the “GEFCom 2014” dynamic dataset, LCEN was able to surpass complex ANN architectures, predicting test-set power loads with a 13.1% lower RMSE than that of the former state-of-the-art model (Table 6). Due to its high robustness, LCEN also could predict for longer time horizons without large increases in the prediction error.

Overall, these experiments have demonstrated the applicability of LCEN to a multitude of scientific and nonscientific problems. LCEN models were robust to defects in the real data, including noise, multicollinearity, or sample scarcity. LCEN models were typically as accurate as or more accurate than many alternative architectures, yet were also much sparser. The LCEN architecture is also trivial to interpret and displays exactly how each input is contributing to the final output. This combination of accuracy and interpretability is essential for the deployment of machine-learning models in performance-critical scenarios, from aviation to medicine. LCEN is free, open-source, and easy to use, allowing even non-specialists in machine learning to benefit from and use it. Moreover, the additional interpretability can assist in data or model refinement efforts and can make the models robust to changes in data or adversarial input. The main limitations of LCEN are that it is not a universal function approximator, as it can model only the functions present in the expansion of dataset features, and that it sometimes is not as accurate as a dense deep learning method. If a GPU and enough time are available for model training, users in scenarios that focus on accuracy above anything else may prefer to use a deep learning architecture.

There are at least two clear future directions for this work. The first involves using the LCEN algorithm in classification tasks, as many important problems in science and engineering involve classification. A comprehensive and robust analysis of the performance of LCEN in classification tasks will follow this paper. A second future direction involves applying the LCEN algorithm to automatically generate physical equations for certain hybrid model architectures (such as physics-constrained or physics-guided ML). The potential and importance of these models has been analyzed in many works, such as in Refs. [50, 51].

References

[1] G. Shmueli, “To explain or to predict?,” Statistical Science, vol. 25, no. 3, pp. 289–310, 2010.
[2] S. R. Hong, J. Hullman, and E. Bertini, “Human factors in model interpretability: Industry practices, challenges, and needs,” Proceedings of the ACM on Human-Computer Interaction, vol. 4, pp. 1–26, May 2020.
[3] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-agnostic interpretability of machine learning,” in ICML Workshop on Human Interpretability in Machine Learning, pp. 91–95, 2016.
[4] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence, vol. 1, pp. 206–215, 2019.
[5] G. Heinze, C. Wallisch, and D. Dunkler, “Variable selection – A review and recommendations for the practicing statistician,” Biometrical Journal, vol. 60, no. 3, pp. 431–449, 2018.
[6] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716–723, 1974.
[7] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
[8] F. Santosa and W. W. Symes, “Linear inversion of band-limited reflection seismograms,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no. 4, pp. 1307–1330, 1986.
[9] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[10] H. Zou and T. Hastie, “Regularization and Variable Selection Via the Elastic Net,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 67, no. 2, pp. 301–320, 2005.
[11] M. J. Whittingham, P. A. Stephens, R. B. Bradbury, and R. P. Freckleton, “Why do we still use stepwise modelling in ecology and behaviour?,” Journal of Animal Ecology, vol. 75, no. 5, pp. 1182–1189, 2006.
[12] G. Smith, “Step away from stepwise,” Journal of Big Data, vol. 5, p. 32, 2018.
[13] M. Hebiri and J. Lederer, “How correlations influence lasso prediction,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1846–1854, 2013.
[14] S. Tian, Y. Yu, and H. Guo, “Variable selection and corporate bankruptcy forecasts,” Journal of Banking & Finance, vol. 52, pp. 89–100, 2015.
[15] M. Pavlou, G. Ambler, S. Seaman, M. De Iorio, and R. Z. Omar, “Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events,” Statistics in Medicine, vol. 35, no. 7, pp. 1159–1177, 2016.
[16] A. S. Dalalyan, M. Hebiri, and J. Lederer, “On the prediction performance of the Lasso,” Bernoulli, vol. 23, no. 1, pp. 552–581, 2017.
[17] Y. Zhang, W. Guo, and S. Ray, “On the consistency of feature selection with lasso for non-linear targets,” in Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 183–191, 2016.
[18] V. C. Dinh and L. S. Ho, “Consistent feature selection for analytic deep neural networks,” in Advances in Neural Information Processing Systems, vol. 33, pp. 2420–2431, 2020.
[19] T. McConaghy, FFX: Fast, Scalable, Deterministic Symbolic Regression Technology, pp. 235–260. New York: Springer, 2011.
[20] P. T. Brewick, S. F. Masri, B. Carboni, and W. Lacarbonara, “Enabling reduced-order data-driven nonlinear identification and modeling through naïve elastic net regularization,” International Journal of Non-Linear Mechanics, vol. 94, pp. 46–58, 2017.
[21] W. Sun and R. D. Braatz, “ALVEN: Algebraic learning via elastic net for static and dynamic nonlinear model identification,” Computers & Chemical Engineering, vol. 143, p. 107103, 2020.
[22] M. Yamada, J. Tang, J. Lugo-Martinez, E. Hodzic, R. Shrestha, A. Saha, H. Ouyang, D. Yin, H. Mamitsuka, C. Sahinalp, P. Radivojac, F. Menczer, and Y. Chang, “Ultra high-dimensional nonlinear feature selection for big biological data,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 7, pp. 1352–1365, 2018.
[23] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, “Group sparse regularization for deep neural networks,” Neurocomputing, vol. 241, p. 81–89, June 2017.
[24] J. Wang, H. Zhang, J. Wang, Y. Pu, and N. R. Pal, “Feature selection using a neural network with group Lasso regularization and controlled redundancy,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, pp. 1110–1123, 2021.
[25] I. Lemhadri, F. Ruan, and R. Tibshirani, “LassoNet: Neural networks with feature sparsity,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, vol. 130, pp. 10–18, 2021.
[26] L. Zhao, Q. Hu, and W. Wang, “Heterogeneous feature selection with multi-modal deep neural networks and sparse group LASSO,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1936–1948, 2015.
[27] M. F. Balın, A. Abid, and J. Zou, “Concrete autoencoders: Differentiable feature selection and reconstruction,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 444–453, 2019.
[28] A. Mirzaei, V. Pourahmadi, M. Soltani, and H. Sheikhzadeh, “Deep feature selection using a teacher-student network,” Neurocomputing, vol. 383, pp. 396–408, 2020.
[29] Y. Yamada, O. Lindenbaum, S. Negahban, and Y. Kluger, “Feature selection using stochastic gates,” in Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 10648–10659, 2020.
[30] J. Rosenzweig, J. Sicking, S. Houben, M. Mock, and M. Akila, “Patch shortcuts: Interpretable proxy models efficiently find black-box vulnerabilities,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 56–65, 2021.
[31] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller, “Unmasking clever hans predictors and assessing what machines really learn,” Nature Communications, vol. 10, p. 1096, 3 2019.
[32] S. Waugh, Extending and Benchmarking Cascade-Correlation: Extensions to the Cascade-Correlation Architecture and Benchmarking of Feed-forward Supervised Artificial Neural Networks. PhD thesis, University of Tasmania, 1995.
[33] D. Clark, Z. Schreter, and A. Adams, “A quantitative comparison of dystal and backpropagation,” in Proceedings of the Seventh Australian Conference on Neural Networks, pp. 132–137, 1996.
[34] I.-C. Yeh, “Modeling of strength of high-performance concrete using artificial neural networks,” Cement and Concrete Research, vol. 28, no. 12, pp. 1797–1808, 1998.
[35] W. Sun and R. D. Braatz, “Smart process analytics for predictive modeling,” Computers & Chemical Engineering, vol. 144, p. 107134, 2021.
[36] A. N. Tikhonov, “Solution of incorrectly formulated problems and the regularization method.,” Doklady Akademii Nauk SSSR, vol. 4, pp. 1035–1038, 1963.
[37] H. Wold, “11 - Path models with latent variables: The NIPALS approach,” in Quantitative Sociology (H. M. Blalock, A. Aganbegian, F. M. Borodkin, R. Boudon, and V. Capecchi, eds.), pp. 307–357, New York: Academic Press, 1975.
[38] H. Wold, “Soft modelling by latent variables: The non-linear iterative partial least squares (NIPALS) approach,” Journal of Applied Probability, vol. 12, no. S1, p. 117–142, 1975.
[39] T. K. Ho, “Random decision forests,” in Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282, 1995.
[40] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
[41] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
[42] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, p. 144–152, 1992.
[43] S. A. Hutzler and S. R. Westbrook, “Estimating chemical and bulk properties of middle distillate fuels from near-infrared spectra,” tech. rep., Defense Technical Information Center, U.S. Army TARDEC, Warren, Michigan, 2000. Report TFLRF No. 348.
[44] W. Nash, T. Sellers, S. Talbot, A. Cawthorn, and W. Ford, “Abalone.” UCI Machine Learning Repository, 1995.
[45] I.-C. Yeh, “Analysis of strength of concrete using design of experiments and neural networks,” Journal of Materials in Civil Engineering, vol. 18, no. 4, pp. 597–604, 2006.
[46] D. J. Harrison and D. L. Rubinfeld, “Hedonic housing prices and the demand for clean air,” J. Environ. Econ. Manage., vol. 5, pp. 81–102, 1978.
[47] T. Hong, P. Pinson, S. Fan, H. Zareipour, A. Troccoli, and R. J. Hyndman, “Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond,” International Journal of Forecasting, vol. 32, no. 3, pp. 896–913, 2016.
[48] H. Wilms, M. Cupelli, and A. Monti, “Combining auto-regression with exogenous variables in sequence-to-sequence recurrent neural networks for short-term load forecasting,” in IEEE 16th International Conference on Industrial Informatics, pp. 673–679, 2018.
[49] A. Gasparin, S. Lukovic, and C. Alippi, “Deep learning for time series forecasting: The electric load case,” CAAI Transactions on Intelligence Technology, vol. 7, no. 1, pp. 1–25, 2022.
[50] G. C. Y. Peng, M. Alber, A. B. Tepole, W. R. Cannon, S. De, S. Dura-Bernal, K. Garikipati, G. Karniadakis, W. W. Lytton, P. Perdikaris, L. Petzold, and E. Kuhl, “Multiscale modeling meets machine learning: What can we learn?,” Archives of Computational Methods in Engineering, vol. 28, pp. 1017–1037, 2021.
[51] J. Willard, X. Jia, S. Xu, M. Steinbach, and V. Kumar, “Integrating scientific knowledge with machine learning for engineering and environmental systems,” ACM Comput. Surv., vol. 55, p. 66, Nov 2022.
[52] A. Schweitzer, V. M. Passegger, C. Cifuentes, V. J. S. Béjar, M. Cortés-Contreras, J. A. Caballero, C. del Burgo, S. Czesla, M. Kürster, D. Montes, M. R. Zapatero Osorio, I. Ribas, A. Reiners, A. Quirrenbach, P. J. Amado, J. Aceituno, G. Anglada-Escudé, F. F. Bauer, S. Dreizler, S. V. Jeffers, E. W. Guenther, T. Henning, A. Kaminski, M. Lafarga, E. Marfil, J. C. Morales, J. H. M. M. Schmitt, W. Seifert, E. Solano, H. M. Tabernero, and M. Zechmeister, “The CARMENES search for exoplanets around M dwarfs. Different roads to radii and masses of the target stars,” Astron. Astrophys., vol. 625, p. A68, May 2019.
[53] J. Kepler, E. J. Aiton, A. M. Duncan, and J. V. Field, The Harmony of the World, pp. 418, 422. American Philosophical Society, 1997.
[54] Wolfram Alpha LLC, “Wolfram|Alpha,” 2022.
[55] I.-C. Yeh, “Concrete Compressive Strength.” UCI Machine Learning Repository, 2007.

A1 Appendix – Algorithm

Algorithm 1 LASSO-Clip-EN (LCEN)

Input: scaled data; lists of hyperparameters alpha, l1_ratio, degree, lag; hyperparameter cutoff # LASSO step

Temporarily set l1_ratio

=1

for each combination in (alpha, degree, lag) do

Perform cross-validation.

end for

Obtain the best hyperparameters from the above cross-validation.

Obtain parameters with these hyperparameters.

Record the best degree and lag hyperparameters. # Clip step

Remove all parameters with absolute values < cutoff. # EN step

Restore l1_ratio to its original list of values.

for each combination in (alpha, l1_ratio) do

Perform cross-validation.

end for

Obtain the best hyperparameters from the above cross-validation.

Obtain parameters with these hyperparameters. # Clip step II

Remove all parameters with absolute values < cutoff.

return

A2 Appendix – Summary of datasets used in this work

The “Linear, 5-variable polynomial” dataset was created by drawing numbers from a uniform distribution between 1 and 10 and summing, potentially adding noise, such that $y=-2.8X_{0}-2.7X_{1}-5.3X_{2}+4.3X_{3}+9.0X_{4}+\epsilon$ . The “Multicollinear data” dataset was created by drawing numbers from a uniform distribution between 1 and 10 to create one variable $X_{0}$ , which was used together with a small amount of noise to create a correlated variable $X_{1}=X_{0}+\epsilon_{1}$ ; finally, they were summed such that $y=2X_{0}+2X_{1}+\epsilon_{2}$ . The “Relativistic energy” dataset was created by drawing numbers from a uniform distribution between 1 and 10 or 1 and 100 for masses, and 5 $\times$ 10⁷ and 2.5 $\times$ 10⁸ for velocities, which represent the energy of a body as $E^{2}=c^{4}m^{2}+c^{2}m^{2}v^{2}$ . With these velocity numbers, relativistic effects are responsible for 20.4% of the total squared energy on average. The “4th-degree, univariate polynomial” dataset was created by drawing numbers from a normal distribution with mean 0 and variance 5 and transforming them into the polynomial $y=X+0.5X^{2}+0.1X^{3}+0.05X^{4}+\epsilon$ . The real datasets are described in Sections A6.1 and 3.2.

Table A1: Datasets used in this work and their sources. The artificial datasets are used in Section 3.1; the real datasets from processes with known physical laws are used in Section A6.1; and the real datasets from processes with unknown physical laws are used in Section 3.2.

Dataset Name Source Linear, 5-variable polynomial Artificial data generated by us Multicollinear data Artificial data generated by us Relativistic energy Artificial data generated by us 4th-degree, univariate polynomial Artificial data generated by us CARMENES star data [52] [link to dataset] Kepler’s 3rd Law [53] (Original from 1619) [54] (Modern) Diesel Freezing Point [43] [link to dataset] Abalone [44] Concrete Compressive Strength [34] [dataset: [55]] Boston housing (modified by us) [46] [link to dataset] GEFCom 2014 [47] [link to dataset]

A3 Appendix – List of hyperparameters used in this work

All possible permutations of the hyperparameters below were cross-validated.

1.

For the LASSO and Ridge regression models: $\alpha=0$ and 20 log-spaced values between -4.3 and 0 (as per np.logspace(-4.3, 0, 20)).
2.

For the elastic net (EN) models: $\alpha$ as above and L1 ratios equal to [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.97, 0.99].
3.

For the partial least squares (PLS) models: a number of components equal to all integers between 1 and a limit were used. This limit is either the number of features or 80% of the number of samples, whichever is smaller.
4.

For the LCEN models: $\alpha$ and L1 ratios as above. degree values equal to [1, 2, 3] were typically used, except when otherwise indicated (such as in the “Relativistic energy” dataset). lag = 0 was used, except for the “GEFCom 2014” dataset, which used lag = 168. cutoff values between $1$ $\times$ $10^{-3}$ and $5.5$ $\times$ $10^{-1}$ were used; higher values were used only when intentionally creating models with fewer selected features. A cutoff = 0 is used in the ablation tests for the LASSO-EN model (Section A4).
5.

For the random forest (RF) and gradient-boosted decision tree (GBDT) models: [10, 25, 50, 100, 200, 300] trees, maximum tree depth equal to [2, 3, 5, 10, 15, 20, 40], minimum fraction of samples per leaf equal to [0.01, 0.02, 0.05, 0.1], and minimum fraction of samples per tree equal to [0.1, 0.25, 0.333, 0.5, 0.667, 0.75, 1.0]. For the GBDT models, learning rates equal to [0.01, 0.05, 0.1, 0.2] were also used.
6.

For the AdaBoost (AdaB) models: [10, 25, 50, 100, 200, 300] trees/estimators and learning rates equal to [0.01, 0.05, 0.1, 0.2] were used.
7.

For the support vector machine (SVM) models: C values equal to [0.01, 0.1, 1, 10, 50, 100], epsilon values equal to [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3], and gamma values equal to [1/50, 1/10, 1/5, 1/2, 1, 2, 5, 10, 50] divided by the number of features in a dataset were used.
8.

For the multilayer perceptron (MLP), MLP with group LASSO (MLP-GL₁), and LassoNet models: the hidden layer sizes varied for each dataset. Representing an MLP with one hidden layer as [X] and an MLP with two as [X, Y], hidden layer sizes of {[800], [400], [200], [100], [800, 800], [800, 400], [400, 400], [400, 200], [200, 200], [200, 100], [100, 100]} were used with the “Diesel Freezing Point” dataset, {[18], [9], [4], [18, 18], [18, 9], [9, 9], [9, 4], [9, 2], [4, 4]} were used with the “Abalone” dataset, {[16], [8], [4], [48, 16], [48, 8], [40, 24], [40, 16], [40, 8], [32, 16], [32, 8], [24, 16], [24, 8], [16, 16], [16, 8], [8, 8], [8, 4]} were used with the “Concrete Compressive Strength” dataset, and {[26], [13], [6], [78, 26], [65, 39], [65, 26], [65, 13], [52, 39], [52, 26], [52, 13], [39, 39], [39, 26], [39, 13], [26, 26], [26, 13], [13, 13], [13, 6]} were used with the “Boston housing” dataset. Learning rates equal to [0.001, 0.005, 0.01], activation functions in [“relu”, “tanhshrink”], a batch size of 32, 100 epochs, and a cosine scheduler with a minimum learning rate equal to 1/16 of the original learning rate with 10 epochs of warm-up were also used. For the MLP-GL₁ and LassoNet, regularization parameters equal to [1 $\times 10^{-4}$ , 1 $\times 10^{-3}$ , 1 $\times 10^{-2}$ ] were used.

Table A2: Additional features included for each value of the degree hyperparameter for a dataset with three features labeled

X_{0}

X_{1}

, and

X_{2}

. A degree of

n

(any natural number) also includes all features from degrees 1 to

n-1

Degree Sample new features included [for all $k$ ] Total features after expansion 1 intercept, $X_{k},\ln X_{k},({X_{k}})^{1/2},{1}/{X_{k}}$ 13 2 $(X_{k})^{2},X_{0}X_{1},X_{0}X_{2},X_{1}X_{2},(\ln X_{k})^{2},(X_{k})^{3/2},\frac{1}{(X_{k})^{2}},\frac{\ln{X_{k}}}{X_{k}}$ 37 3 $(X_{k})^{3},X_{0}X_{1}X_{2},(X_{0})^{2}X_{1},X_{1}(X_{2})^{2},(\ln X_{k})^{3},(X_{k})^{5/2},\frac{1}{(X_{k})^{3}},\frac{\ln(X_{k})^{2}}{X_{k}},\frac{\ln(X_{k})}{(X_{k})^{2}}$ 75 4 […] 129 5 […] 201

A4 Appendix – Ablation tests

To better clarify the design choices of the LCEN architecture and highlight the relevance of each individual part of the algorithm, ablation tests are performed. Three ablated architectures – LASSO-Clip (LC), EN-Clip (ENC), and LASSO-EN (LEN) – are compared with the original LCEN algorithm. Two variant architectures, LASSO-Clip-LASSO (LCL) and EN-Clip-EN (ENCEN), are also compared. The “Relativistic energy”, “Diesel Freezing Point”, “Abalone”, and “Concrete Compressive Strength” datasets are used in the ablation tests. Tests with the “Relativistic energy” dataset show that models with a Clip step had some degree of success with selecting only relevant features (Table A3). However, the ablated architectures (LC, ENC, and LEN) had much higher prediction errors for the coefficients of the relevant features, even though LC and ENC were able to select only the relevant features. The variant architectures (LCL and ENCEN) had performances closer to that of LCEN, but LCL was slightly worse in terms of error. The models that begin with EN (ENC and ENCEN) are approximately one order of magnitude slower than LCEN on all datasets (Tables A3–A6), whereas LC was approximately 50% faster than LCEN. However, LCEN consistently had the lowest validation RMSE in all datasets, and had the lowest test-set RMSE in all but one dataset. As highlighted by Table A4, LCEN built the sparsest and most accurate models out of all ablated and variant architectures trained with the “Diesel Freezing Point” dataset. Overall, these ablation experiments highlight how LCEN is the optimal architecture to maximize accuracy and selectivity while maintaining a low runtime.

Table A3: Relative error to the ground truth for the “Relativistic energy” dataset with

1\leq m<100

at different noise levels for ablated and variant LCEN model architectures. The first coefficient is for

m^{2}

and the second coefficient is for

m^{2}v^{2}

. Compare with the right-side table of Table 2 and a runtime of 4.79 seconds for LCEN.

Noise LC Error (%) ENC Error (%) LEN Error (%) LCL Error (%) ENCEN Error (%) 0% 36.62, 18.08 41.52, 20.91 43.75, 22.32 0.007, 0.012 0.001, 0.006 5% 37.68, 18.85 41.30, 21.16 43.93, 23.45 0.011, 0.029 0.005, 0.022 10% 18.92, 1.647 44.31, 23.70 1.137, 0.468 0.015, 0.045 0.009, 0.038 15% 39.71, 20.61 44.31, 23.70 46.65, 26.40 0.019, 0.061 0.013, 0.054 20% 39.65, 21.13 39.99, 21.95 45.87, 27.23 0.023, 0.077 0.017, 0.070 30% 22.35, 2.603 22.93, 2.660 7.649, 1.036 0.031, 0.109 0.025, 0.103 Runtime (s) 3.70 37.1 5.40 3.86 38.5

Table A4: Results of different ablated and variant LCEN model architectures for the “Diesel Freezing Point” dataset. Compare with Table 4.

Architecture Test RMSE (^∘C) Features Runtime (s) LC 4.84 37 4.39 5.15 29 4.31 ENC 4.80 263 20.6 5.00 257 20.7 LEN 4.87 39 6.85 LCL 4.93 33 4.74 5.01 28 4.62 ENCEN 4.83 191 31.1 4.90 173 30.1

Table A5: Results of different ablated and variant LCEN model architectures for the “Abalone” dataset. Compare with Table A7 and a runtime of 19.2 seconds for LCEN.

Architecture Test RMSE (rings) Features Runtime (s) LC 2.1 8 11.8 ENC 2.1 8 297 LEN 2.1 8 26.3 LCL 2.0 8 12.9 ENCEN 2.1 8 308

Table A6: Results of different ablated and variant LCEN model architectures for the “Concrete Compressive Strength” dataset. All models selected all 8 features, but a varying number of transforms of these features. Compare with Table 5 and a runtime of 39.7 seconds for LCEN.

Architecture Test RMSE (MPa) Runtime (s) LC 5.53 24.6 ENC 16.5 800 LEN 5.50 44.9 LCL 5.92 26.4 ENCEN 6.12 863

A5 Appendix – Additional results with artificial data

A6 Appendix – Additional results with empirical data

A6.1 Datasets for which physical laws are available

The first test of an empirical dataset from a process with a known physical law uses the “CARMENES star data” dataset from Ref. [52]. This dataset contains information on temperature ( $T$ ), radius ( $R$ ), and luminosity ( $L$ ) of 293 white dwarf stars. These features are linked together by the Stefan-Boltzmann equation, $L=4\pi R^{2}\sigma T^{4}$ , where $\sigma$ is a constant. Normalizing this equation to values from another star (typically, the Sun), conveniently sets the constant terms to 1. This normalization is applied to the “CARMENES star data” dataset. LCEN with degrees from 1 to 10 was applied to this normalized dataset. Despite the very large number of potential features (due to the high degree values used), LCEN correctly selected only the $R^{2}T^{4}$ feature. The coefficient assigned to $R^{2}T^{4}$ is 0.9826, which is well within the 2–3% error on these data (as reported by Ref. [52]). LCEN retained high performance for this real data in a high-hyperparameter variance scenario.

A potential limitation in real datasets is data scarcity. To evaluate the LCEN algorithm in a low-data scenario, the “Kepler’s 3rd Law” datasets are created. The first version uses the original data obtained by Kepler, first published in 1619 and republished in Ref. [53]. From only 6 (slightly inaccurate) measurements, Kepler was able to derive the eponymous Kepler’s 3rd law, which states that the period $T$ of a celestial body is related to the semi-major axis of its orbit $a$ by $T=ka^{3/2}$ . The constant $k$ depends on the masses of the central and orbiting bodies; however, as the mass of the central body is typically much larger, the mass of the orbiting body is ignored. In this and Kepler’s works, $T$ is measured in Earth days, so the constant $k$ is $\sim$ 365.25 when using modern data and $\sim$ 365.15 when using Kepler’s original data. Despite the low number of data points, LCEN correctly selected only the $a^{3/2}$ feature. Moreover, the coefficient assigned to that feature was 366.82, an error of only 0.46% relative to Kepler’s $k=365.15$ .

LCEN is then evaluated using a modern version of the same dataset, which contains 8 points (as Uranus and Neptune were discovered after Kepler’s observations) whose data were measured with greater accuracy. On this modern “Kepler’s 3rd Law” dataset, LCEN again selects only the $a^{3/2}$ feature. The coefficient assigned to the $a^{3/2}$ feature is 365.00, an error of only 0.07% relative to the modern value $k=365.25$ . LCEN did perfect feature selection in these data-scarce scenarios, with parameter estimates minimally affected by experimental noise.

A6.2 Datasets for which no physical law is available

Table A7: Results of different model architectures for the “Abalone” dataset. These results are discussed in Section 3.2.

Architecture Test RMSE (rings) Features OLS = PLS = RR = LASSO = EN 2.1 8 RF 2.1 8 GBDT 2.2 8 AdaB 2.3 8 SVM 2.0 8 MLP 2.0 8 MLP-GL₁ 2.0 8 LassoNet 2.0 8 LCEN 2.0 8 2.1 3 2.2 2

The “Boston housing” dataset contains the median value of owner-occupied houses and many internal and external measurements, such as the per-capita crime rate of the region, the average number of rooms, and the concentration of nitric oxides in the area [46]. We modified this dataset to detransform the B variable into its raw value; samples in which this detransformation led to multiple possible values were discarded. In terms of test RMSE, the linear models (OLS, PLS, RR, EN, LASSO) tended to perform equal to each other and quite poorly on this dataset. RF and SVM performed relatively well, but GBDT and AdaB had the two worst performances among the nonlinear models. A dense MLP was the best model in terms of test RMSE, and the MLP-GL₁ and LassoNet performed similarly but slightly worse. LCEN had a very high performance on this regression task, reaching a test RMSE only 6% higher than that of the dense MLP. LCEN also had the lowest validation RMSE, which was 54% lower than that of the dense MLP.

Table A8: Results of different model architectures for the “Boston housing” dataset.

Architecture Test RMSE (Thousands USD) OLS 6.38 PLS 6.39 RR = EN 6.39 LASSO 6.38 RF 5.02 GBDT 5.91 AdaB 5.67 SVM 5.15 MLP 4.51 MLP-GL₁ 4.88 LassoNet 4.76 LCEN 4.78

A7 Appendix – Computational resources used

All experiments were done in a personal computer equipped with a 13th Gen Intel^® Core™ i5-13600K CPU, 64 GB of DDR4 RAM, and a NVIDIA GeForce RTX 4090 GPU. Runtimes for the models in the “Diesel Freezing Point” dataset are provided in Table 4, while runtimes for LCEN and ablated architectures are provided in the tables of Section A4.