Fractional ridge regression: a fast, interpretable reparameterization of ridge regression

\nameAriel Rokem \email[email protected]
\addrThe University of Washington eScience Institute
University of Washington
Seattle, WA 98195, USA \AND\nameKendrick Kay \email[email protected]
\addrCenter for Magnetic Resonance Research
University of Minnesota
Twin Cities, MN 55455, USA

Abstract

Ridge regression (RR) is a regularization technique that penalizes the L2-norm of the coefficients in linear regression. One of the challenges of using RR is the need to set a hyperparameter ( $\alpha$ ) that controls the amount of regularization. Cross-validation is typically used to select the best $\alpha$ from a set of candidates. However, efficient and appropriate selection of $\alpha$ can be challenging, particularly where large amounts of data are analyzed. Because the selected $\alpha$ depends on the scale of the data and predictors, it is not straightforwardly interpretable. Here, we propose to reparameterize RR in terms of the ratio $\gamma$ between the L2-norms of the regularized and unregularized coefficients. This approach, called fractional RR (FRR), has several benefits: the solutions obtained for different $\gamma$ are guaranteed to vary, guarding against wasted calculations, and automatically span the relevant range of regularization, avoiding the need for arduous manual exploration. We provide an algorithm to solve FRR, as well as open-source software implementations in Python and MATLAB (https://github.com/nrdg/fracridge). We show that the proposed method is fast and scalable for large-scale data problems, and delivers results that are straightforward to interpret and compare across models and datasets.

Keywords: Ridge regression, Hyperparameters, Singular value decomposition, Software

1 Introduction

Consider the standard linear model setting $Y=X\beta$ solved for $\beta$ , where $Y$ is the $(d,~{}t)$ data matrix ( $d$ data points in each of $t$ targets), $X$ is the $(d,~{}p)$ design matrix ( $d$ data points for each of $p$ predictors), and $\beta$ is a $(p,~{}t)$ matrix (with $p$ coefficients, one for each predictor, for each of the targets). Ordinary least-squares regression (OLS) and regression based on the Moore-Penrose pseudoinverse (in cases where $p>d$ ) attempt to find the set of coefficients $\beta$ that minimize squared error for each of the targets $y$ . While these unregularized approaches have some desirable properties, in practical applications where noise is present, they tend to overfit the coefficient parameters to the noise present in the data. Moreover, they tend to cause unstable parameter estimates in situations where predictors are highly correlated.

Ridge regression (Hoerl and Kennard, 1970) addresses these issues by trading off the addition of some bias for the reduction of eventual error (e.g., measured using cross-validation (Stone, 1978, 1974)). It does so by penalizing not only the sum of the squared errors in fitting the data for each target, but by also minimizing the squared L2-norm of the solution, $||\beta||_{2}^{2}=\sum{\beta^{2}}$ . Fortunately, this form of regularization does not incur a substantial computational cost. This is because it can be implemented using the same numerical approach for solving unregularized regression, with the simple addition of a diagonal matrix $\alpha I$ to the standard matrix equations. Thus, the computational cost of solving ridge regression is essentially identical to that of the unregularized solution. Thanks to its simplicity, computational expedience, and its robustness in different data regimes, ridge regression is a very popular technique, with the classical references describing the method (Hoerl and Kennard, 1970; Tikhonov and Arsenin, 1977) cited more than 25,000 times according to Google Scholar.

However, beneath the apparent simplicity of ridge regression is the fact that for most applications, it is impossible to determine a priori the degree of regularization that yields the best solution. This means that in typical practice, researchers must test several different hyperparameter values $\alpha$ and select the one that yields the least cross-validation error on a set of data specifically held out for hyperparameter selection. In large-scale data problems, the number of data points $d$ , number of predictors $p$ , and/or number of targets $t$ can be quite large. This has the consequence that the number of hyperparameter values that are tested, $f$ , can pose a prohibitive computational barrier.

Given the difficulty of predicting the effect of $\alpha$ on solution outcomes, it is common practice to test values that are widely distributed on a log scale (for example, see (Friedman et al., 2010)). Although this approach is not grounded in a particular theory, as long as the values span a large enough range and are spaced densely enough, an approximate minimum of the cross-validation error is likely to be found. But testing many $\alpha$ values can be quite costly, and the practitioner might feel tempted to cull the set of values tested. In addition, it is always a possibility that the initial chosen range might be mismatched to the problem at hand. Sampling $\alpha$ values that are too high or too low will produce non-informative candidate solutions that are either over-regularized ( $\alpha$ too high) or too similar to the unregularized solution ( $\alpha$ too low). Thus, in practice, conventional implementations of ridge regression may produce poor solutions and/or waste substantial computational time.

Here, we propose a simple reparameterization of ridge regression that overcomes the aforementioned challenges. Our approach is to produce coefficient solutions that have an L2-norm that is a pre-specified fraction of the L2-norm of the unregularized solution. In this approach, called fractional ridge regression (FRR), redundancies in candidate solutions are avoided because solutions with different fractional L2-norms are guaranteed to be different. Moreover, by targeting fractional L2-norms that span the full range from 0 to 1, the FRR approach explores the full range of effects of regularization on $\beta$ values from under- to over-regularization, thus assuring that the best possible solution is within the range of solutions explored. We provide a fast and automated algorithm to calculate FRR, and provide open-source software implementations in Python and MATLAB. We demonstrate in benchmarking simulations that FRR is computationally efficient for even extremely large data problems, and we show that FRR applies successfully to real-world data and delivers clear and interpretable results. Overall, FRR may prove particularly useful for researchers tackling large-scale datasets where automation, efficiency, and interpretability are critical.

2 Methods

2.1 Background and theory

Consider the dataset $y$ with dimensionality $d$ (number of data points) by $t$ (number of targets). Each column in $y$ represents a separate target for linear regression:

y=X\beta+\epsilon

(1)

where $y$ is the measured data for a single target (dimensionality $d$ by $1$ ; we drop the index on y for notational convenience), $X$ is the “design” matrix with predictors (dimensionality $d$ by $p$ ), $\beta$ are the coefficients (dimensionality $p$ by 1), and $\epsilon$ is a noise term. Our typical objective is to solve for $\beta$ in a way that minimizes the squared error. If $X$ is full rank, the ordinary least squares (OLS) solution to this problem is:

\hat{\beta}^{OLS}=(X^{t}X)^{-1}X^{t}y

(2)

In cases where $X$ is not full rank, the OLS solution is no longer well-defined and the Moore-Penrose pseudoinverse is used instead. We will refer to these unregularized approaches collectively as OLS.

To regularize the OLS solution, ridge regression applies a penalty to the squared L2-norm of the coefficients, leading to a different estimator for $\beta$ :

\hat{\beta}^{RR}=(X^{t}X+\alpha I)^{-1}X^{t}y

(3)

where $\alpha$ is a hyperparameter and $I$ is the identity matrix (Hoerl and Kennard, 1970; Tikhonov and Arsenin, 1977). For computational efficiency, it is well known that the original problem can be rewritten (Hastie and Tibshirani, 2004) using singular value decomposition (SVD) of the matrix $X$ :

X=USV^{t}

(4)

with $dim(U)=(d,p)$ , $dim(S)=(p,p)$ and $dim(V)=(p,p)$ .

Note that $S$ is a square matrix:

S=\left[{\begin{array}[]{ccccc}\lambda_{1}&0&...&\\ 0&\lambda_{2}&0&...\\ 0&0&\lambda_{3}&0&...\\ \vdots\\ ...&0&0&0&\lambda_{p}\\ \end{array}}\right]

with $\lambda_{i}$ as the singular values ordered from largest to smallest. Replacing the design matrix $X$ with its SVD, we obtain:

y=USV^{t}\beta+\epsilon.

(5)

Given that $U$ is unitary (i.e., $U^{t}U$ is $I$ ), left-multiplying each side with $U^{t}$ produces:

U^{t}y=SV^{t}\beta+U^{t}\epsilon.

(6)

Let $\tilde{y}=U^{t}y$ , $\tilde{\beta}=V^{t}\beta$ , and $\tilde{\epsilon}=U^{t}\epsilon$ . These are transformations (rotations) of the original quantities ( $y$ , $\beta$ , and $\epsilon$ ) through the unitary matrices $U^{t}$ and $V^{t}$ . In cases where $p<d$ , this also projects the quantities into a lower-dimensional space of dimensionality $p$ . The OLS solution can be obtained in this space:

\tilde{\hat{\beta}}^{OLS}=(S^{t}S)^{-1}S^{t}\tilde{y},

(7)

which simplifies to:

\tilde{\hat{\beta}}^{OLS}=S^{-2}(\lambda\odot\tilde{y}),

(8)

where $\odot$ is the Hadamard (element-wise) product, and

S^{-2}=\left[{\begin{array}[]{ccccc}\frac{1}{\lambda_{1}^{2}}&0&...&\\ 0&\frac{1}{\lambda_{2}^{2}}&0&...\\ 0&0&\frac{1}{\lambda_{3}^{2}}&0&...\\ \vdots\\ ...&0&0&0&\frac{1}{\lambda_{p}^{2}}\\ \end{array}}\right]

is the inverse of the square of the singular value matrix $S$ . Thus, in the lower-dimensional space, we can solve the OLS problem with a scalar multiplication:

\tilde{\hat{\beta}}^{OLS}_{i}=\frac{1}{\lambda_{i}^{2}}\lambda_{i}\tilde{y_{i}},

(9)

which simplifies finally to

\tilde{\hat{\beta}}^{OLS}_{i}=\frac{\tilde{y_{i}}}{\lambda_{i}}.

(10)

The SVD-based reformulation of regression described above is additionally useful as it provides insight into the nature of ridge regression (Skouras et al., 1994). Specifically, consider the ridge regression solution in the low-dimensional space:

\tilde{\hat{\beta}}^{RR}=(S^{t}S+\alpha I)^{-1}S^{t}\tilde{y}

(11)

To compute this solution, we note that:

S^{t}S+\alpha I=\left[{\begin{array}[]{ccccc}\lambda_{1}^{2}+\alpha&0&...&\\ 0&\lambda_{2}^{2}+\alpha&0&...\\ 0&0&\lambda_{3}^{2}+\alpha&0&...\\ \vdots\\ ...&0&0&0&\lambda_{p}^{2}+\alpha\\ \end{array}}\right]

(12)

the inverse of which is:

(S^{t}S+\alpha I)^{-1}=\left[{\begin{array}[]{ccccc}\frac{1}{\lambda_{1}^{2}+\alpha}&0&...&\\ 0&\frac{1}{\lambda_{2}^{2}+\alpha}&0&...\\ 0&0&\frac{1}{\lambda_{3}^{2}+\alpha}&0&...\\ \vdots\\ ...&0&0&0&\frac{1}{\lambda_{p}^{2}+\alpha}\\ \end{array}}\right]

(13)

Finally, plugging into equation 11, we obtain:

\tilde{\hat{\beta}}^{RR}_{i}=\frac{\lambda_{i}}{\lambda_{i}^{2}+\alpha}\tilde{y_{i}}

(14)

This shows that in the low-dimensional space, ridge regression can be solved using scalar operations.

To further illustrate the relationship between the ridge regression and OLS solutions, by plugging equation 10 into equation 14, we observe the following:

\tilde{\hat{\beta}}^{RR}_{i}=\frac{\lambda_{i}^{2}}{\lambda_{i}^{2}+\alpha}\tilde{\hat{\beta_{i}}}^{OLS}

(15)

In other words, the ridge regression coefficients are simply scaled-down versions of the OLS coefficients, with a different amount of shrinkage for each coefficient. Coefficients associated with larger singular values are less shrunken than those with smaller singular values.

To obtain solutions in the original space, it is necessary to left-multiply the coefficients with $V$ :

\hat{\beta}=V\tilde{\hat{\beta}}

(16)

Next, we turn to fractional ridge regression (FRR). The core concept of FRR is to reparameterize ridge regression in terms of the amount of shrinkage applied to the overall L2-norm of the solution. Specifically, we define the fraction $\gamma$ as:

\gamma=\frac{||\tilde{\hat{\beta}}^{RR}||_{2}}{||\tilde{\hat{\beta}}^{OLS}||_{2}}

(17)

Because $V$ is a unitary transformation, the L2-norm of a coefficient solution in the low-dimensional space, $||\hat{\tilde{\beta}}||_{2}$ , is identical to the L2-norm of the coefficient solution in the original space, $||\hat{\beta}||_{2}$ . Thus, we can operate fully within the low-dimensional space and be guaranteed that the fractions will be maintained in the original space.

In FRR, instead of specifying desired values for $\alpha$ , we instead specify values of $\gamma$ between 1 (no regularization) and 0 (full regularization). But how can one compute the ridge regression solution for a specific desired value of $\gamma$ ? Based on equations 9 and 14, it is easy to calculate the value of $\gamma$ corresponding to a specific given $\alpha$ value:

\gamma=\frac{||\tilde{\hat{\beta}}^{RR}||_{2}}{||\tilde{\hat{\beta}}^{OLS}||_{2}}=\sqrt{\frac{\sum{(\frac{\lambda_{i}\tilde{y}_{i}}{\lambda_{i}^{2}+\alpha})^{2}}}{\sum{(\frac{\tilde{y}_{i}}{\lambda_{i}}})^{2}}}

(18)

In some special cases, this calculation can be considerably simplified. For example, if the eigenvalue spectrum of $X$ is flat ( $\lambda_{i}=\lambda_{j}$ for any $i\neq j$ ), we can set all the singular values to $\lambda$ , yielding the following:

\gamma=\sqrt{\frac{p(\frac{\lambda}{\lambda^{2}+\alpha})^{2}{\sum{\tilde{y}^{2}_{i}}}}{p(\frac{1}{\lambda})^{2}\sum{\tilde{y_{i}}^{2}}}}=\frac{\frac{\lambda}{\lambda^{2}+\alpha}{}}{\frac{1}{\lambda}}=\frac{\lambda^{2}}{\lambda^{2}+\alpha},

(19)

This recapitulates the result obtained in (Hoerl and Kennard, 1970), equation 2.6. We can then solve for $\alpha$ :

\alpha=\lambda^{2}(\frac{1}{\gamma}-1)

(20)

Thus, in this case, there is an analytic solution for the appropriate $\alpha$ value, and one can proceed to compute the ridge regression solution using equation 14. Another special case is if we assume that all values of $\tilde{y}_{i}$ are identical. In this case, we can easily calculate the achieved shrinkage, but in terms of L1-norm (not L2-norm):

\frac{||\tilde{\hat{\beta}}^{RR}||_{1}}{||\tilde{\hat{\beta}}^{OLS}||_{1}}=\frac{\sum{\frac{\lambda_{i}\tilde{y}}{\lambda_{i}^{2}+\alpha}}}{\sum{\frac{\tilde{y}}{\lambda_{i}}}}=\frac{\tilde{y}\sum{\frac{\lambda_{i}}{\lambda_{i}^{2}+\alpha}}}{\tilde{y}\sum{\frac{1}{\lambda_{i}}}}=\sum_{i=1}^{p}\frac{\lambda_{i}^{2}}{\lambda_{i}^{2}+\alpha}.

(21)

Notice that this is the sum of the shrinkages for individual coefficients from equation 15, and has been defined as the effective degrees of freedom of ridge regression (See Hastie et al. (2001), pg. 68).

These two special cases have the appealing feature that the regularization level can be controlled on the basis of examining only the design matrix $X$ . However, they rely on strong assumptions that are not guaranteed to hold in general. Thus, for accurate ridge regression outcomes, we see no choice but to develop an algorithm that uses both the design matrix $X$ and the data values $y$ .

2.2 An algorithm for solving fractional ridge regression

Our proposed algorithm is straightforward: it evaluates $\gamma$ for a range of $\alpha$ values and uses interpolation to determine the $\alpha$ value that achieves the desired fraction $\gamma$ . Although this method relies on brute force and may not seem mathematically elegant, it achieves accurate outcomes and, somewhat surprisingly, can be carried out with minimal computational cost.

The algorithm receives as input a design matrix $X$ , target variables $Y$ , and a set of requested fractions $\gamma$ . The algorithm calculates the FRR solutions for all targets in $Y$ , returning estimates of the coefficients $\hat{\beta}$ as well as the values of hyperparameter $\alpha$ that correspond to each requested $\gamma$ . In the text below, we indicate the lines of code that implement each step of the algorithm (see also section 2.3 below) in the MATLAB (designated with “M”) and Python (designated with “P”) implementations.

1.

Compute the SVD of the design matrix, $USV^{t}=X$ (M247, P73). To avoid numerical instability, very small eigenvalues of $X$ are treated as 0.
2.

The data are transformed $\tilde{y}=U^{t}y$ (M254, P75).
3.

The OLS problem is solved with one broadcast division operation (equation 10) (M272, P79).
4.

The values of $\alpha$ that satisfy the FRR requirement are guaranteed to lie within a range that depends on the eigenvalues of $X$ . A series of initial candidate values of $\alpha$ are selected to span a log-spaced range from $10^{-3}\lambda_{p}^{2}$ , much smaller than the smallest singular value of the design matrix, to $10^{3}\lambda_{1}^{2}$ , much larger than the largest singular value of the design matrix (M295-299, P89-96). Based on testing on a variety of regression problems, we settled on a spacing of $0.2~{}log_{10}$ units within the range of candidate $\alpha$ values. This provides fine enough gridding such that interpolation results are nearly perfect (empirical fractions are approximately $1\%$ or less from the desired fractions).
5.

Based on equation 15, a scaling factor for every value of $\alpha$ and every eigenvalue $\lambda$ is calculated as (M310-312, P97-98):

$SF_{i,j}=\lambda_{i}^{2}/(\lambda_{i}^{2}+\alpha_{j})$ (22)
6.

The main loop of the algorithm iterates over targets. For every target, the scaling in equation 22 is applied to the computed OLS coefficients (from Step 3), and the L2-norm of the solution at each $\alpha_{j}$ is divided by the L2-norm of the OLS solution to determine the fractional length, $\gamma_{j}$ (M332-345, P111-112).
7.

Interpolation is used with $\alpha_{j}$ and $\gamma_{j}$ to find values of $\alpha$ that correspond to the desired values of $\gamma$ (M361, P113). These target $\alpha$ values are then used to calculate the ridge regression solutions via equation 15 (M367, P116-117).
8.

After the iteration over targets is complete, the solutions are transformed to the original space by multiplying $\hat{\beta}=V\tilde{\hat{\beta}}$ (M418, PL119).

In summary, this algorithm requires just one (potentially computationally expensive) initial SVD of the design matrix. Operations done on a per-target basis are generally inexpensive, relying on fast vectorized array operations, with the exception of the interpolation step. Although a large range of candidate $\alpha$ values are evaluated internally by the algorithm, these values are eventually discarded, thereby avoiding costs associated with the final step (multiplication with $V$ ).

2.3 Software implementation

We implemented the algorithm described in section 2.2 in two different popular statistical computing languages: MATLAB and Python (example code in Figure 1). The code for both implementations is available at https://github.com/nrdg/fracridge and released under an OSI-approved, permissive open-source license to facilitate its broad use. In both MATLAB and Python, we used broadcasting to rapidly perform computations over multiple dimensions of arrays.

There are two potential performance bottlenecks in the code. One is the SVD step which is expensive both in terms of memory and computation time. This step is optimized by observing that for cases in which $d>p$ (the number of data points is larger than the number of parameters), we can replace the singular values of $X$ by the square roots of the singular values of $X^{t}X$ , which is $p$ by $p$ and therefore requires less memory for the SVD computation. The other potential performance bottleneck is the interpolation performed for each target. To optimize this step, we used fast interpolation functions that assume sorted inputs.

2.3.1 Matlab

The MATLAB implementation of FRR relies only on core MATLAB functions and a fast implementation of linear interpolation (Mier, 2020), which is copied into the fracridge source code, together with its license, which is compatible with the fracridge license. The MATLAB implementation includes an option to automatically standardize predictors (either center or also scale the predictors) before regularization, if desired.

2.3.2 Python

The Python implementation of FRR depends on Scipy (Virtanen et al., 2020) and Numpy (van der Walt et al., 2011). The object-oriented interface provided conforms with the API of the popular Scikit-Learn library (Pedregosa et al., 2011; Buitinck et al., 2013), including automated tests that verify compliance with this API. Unit tests are implemented using pytest (Krekel et al., 2004–). Documentation is automatically compiled using sphinx, with sphinx-gallery examples (Òscar Nájera et al., 2020). The Python implementation also optionally uses Numba (Lam et al., 2015) for just-in-time compilation of a few of the underlying numerical routines used in the implementation. This functionality relies on an implementation provided in the hyperlearn library (Han-Chen, 2020) and copied into the fracridge source-code, together with its license, which is compatible with the fracridge license. In addition to its release on GitHub, the software is available to install through the Python Package Index (PyPI) through the standard Python Package Installer (pip install fracridge). For Python, we did not implement standardization procedures, as those are implemented as a part of Scikit-Learn.

Matlab

⬇

y = randn(100,1);

X = randn(100,10);

% Calculate coefficients with naive OLS

coef = inv(X’*X)*X’*y;

% Call the fracridge function:

[coef2, alpha] = fracridge(X, 0.3, y);

% Calculate coefficients with naive RR

alphaI = alpha*eye(size(X, 2));

coef3 = inv(X’*X + alphaI)*X’*y;

norm(coef)

norm(coef2)

norm(coef2) ./ norm(coef)

norm(coef2-coef3)

Python

⬇

import numpy as np

from numpy.linalg import inv, norm

from fracridge import fracridge

from fracridge import FracRidge

y = np.random.randn(100)

X = np.random.randn(100, 10)

# Calculate coefficients with naive OLS

coef = inv(X.T @ X) @ X.T @ y

# Call fracridge function:

coef2, alpha = fracridge(X, y, 0.3)

# Calculate coefficients with naive RR

alphaI = alpha * np.eye(X.shape[1])

coef3 = inv(X.T @ X + alphaI) @ X.T @ y

print(norm(coef))

print(norm(coef2))

print(norm(coef2) / norm(coef))

print(norm(coef2 - coef3))

# sklearn-compatible object-oriented API:

fr = FracRidge(fracs=0.3)

fr.fit(X, y)

coef_oo = fr.coef_

alpha_oo = fr.alpha_

print(norm(coef_oo) / norm(coef))

Figure 1: Code examples. Left: MATLAB examples that demonstrate the software API and correctness of the implementation. Right: Python examples demonstrate a similar API and correctness. Python examples include the Sckit-Learn-compatible API

2.4 Simulations

Numerical simulations were used to characterize FRR and compare it to a heuristic approach for hyperparameter selection. Simulations were conducted using the MATLAB implementation. We simulated two simple regression scenarios. The number of data points ( $d$ ) was 100, and the number of predictors ( $p$ ) was either 5 or 100. In each simulation, we first created a design matrix X ( $d,p$ ) using the following procedure: (i) generate normally distributed values for X, (ii) induce correlation between predictors by selecting two predictors at random, setting one of the predictors to the sum of the two predictors plus normally distributed noise, and repeating this procedure $2p$ times, and (iii) z-scoring each predictor. Next, we created a set of “ground truth” coefficients $\beta$ with dimensions ( $p,1$ ) by drawing values from the normal distribution. Finally, we simulated responses from the model ( $y=X\beta$ ) and added normally distributed noise, producing a target variable $y$ with dimensions ( $d,1$ ).

Given design matrix $X$ and target $y$ , cross-validated regression was carried out. This was done by splitting $X$ and $y$ into two halves (50/50 training/testing split), solving ridge regression on one half (training) and evaluating generalization performance of the estimated regression $\beta$ weights on the other half (testing). Performance was quantified using the coefficient of determination ( $R^{2}$ ). For standard ridge regression, we evaluated a grid of $\alpha$ values that included 0 and ranged from $10^{-4}$ through $10^{5.5}$ in increments of $0.5~{}log_{10}$ units. For FRR, we evaluated a range of fractions $\gamma$ from 0 to 1 in increments of 0.05. Thus, the number of hyperparameter values was $f=21$ in both cases.

The code that implements these simulations is available in the “examples” folder of the software.

2.5 Performance benchmark

To characterize the performance of fractional ridge regression (FRR) and standard ridge regression (SRR) approaches, a set of numerical benchmarks was conducted using the MATLAB implementation. A range of regression scenarios were constructed. In each experiment, we first constructed a design matrix X ( $d,p$ ) consisting of values drawn from a normal distribution. We then created “ground truth” coefficients $\beta$ ( $p,t$ ) also by drawing values from the normal distribution. Finally, we generated a set of data $Y$ ( $d,t$ ) by predicting the model response ( $y=X\beta$ ) and adding zero-mean Gaussian noise with standard deviation equal to the standard deviation of the data from each target variable. Different levels of regularization ( $f$ ) were obtained for SRR by linearly spacing $\alpha$ values on a $log_{10}$ scale from $10^{-4}$ to $10^{5}$ and for FRR by linearly spacing fractions from 0.05 to 1 in increments of 0.05.

Two versions of SRR were implemented and evaluated. The first version (naïve) involves a separate matrix (pseudo-)inversion for each hyperparameter setting desired. The second version (rotation-based) involves using the SVD decomposition method described above (see section 2.1, specifically equation 14).

All simulations were run on an Intel Xeon E5-2683 2.10 Ghz (32-core) workstation with 128 GB of RAM, a 64-bit Linux operating system, and MATLAB 8.3 (R2014a). Execution time was logged for model fitting procedures only and did not include generation of the design matrix or the data. Likewise, memory requirements were recorded in terms of additional memory usage during the course of model fitting (i.e. zero memory usage corresponds to the total memory usage just prior to the start of model fitting). Benchmarking results were averaged across 15 independent simulations to reduce incidental variability.

The code that implements these benchmarks is available in the “examples” folder of the software.

2.6 Brain Magnetic Resonance Imaging data

Brain functional Magnetic Resonance Imagine (fMRI) data were collected in a 7 Tesla MRI instrument, at a spatial resolution of $1.8~{}mm$ and a temporal resolution of $1.6~{}s$ and using a matrix size of [81 104 83]. This yielded a total of 783,432 voxels. Over the course of 40 separate scan sessions, a participant viewed 10,000 distinct images (3 presentations per image) while fixating a small dot placed at the center of the images (see Figure 3A). The images were $8.4\deg$ by $8.4\deg$ in size. Each image was presented for $3~{}s$ and was followed by a $1~{}s$ gap. Standard pre-processing steps were applied to the fMRI data to remove artifacts due to head motion and other confounding factors. To deal with session-wise nonstationarities, response amplitudes of each voxel were z-scored within each scan session. Responses were then concatenated across sessions and averaged across trials of the same image, and then a final z-scoring of each voxel’s responses was performed.

A regression model was used to predict the response observed from a voxel in terms of local contrast present in the stimulus image. In the model, the stimulus image is pre-processed by taking the original color image (425 pixels by 425 pixels by 3 RGB channels), converting the image to grayscale, gridding the image into 25 by 25 regions, and then computing the standard deviation of luminance values within each grid region (Figure 4B). This produced 625 predictors, each of which was then z-scored. The design matrix $X$ has dimensionality 10,000 images by 625 stimulus regions, while $Y$ has dimensionality 10,000 images by 783,432 voxels.

Cross-validation was carried out using a 80/20 training/testing split. For standard ridge regression, we evaluated a grid of alpha values that included 0 and ranged from $10^{-4}$ to $10^{5.5}$ in increments of $0.5~{}log_{10}$ units. For fractional ridge regression, we evaluated a range of fractions from 0 to 1 in increments of 0.05. Cross-validation performance was quantified in terms of variance explained on the test set using the coefficient of determination ( $R^{2}$ ).

The code that implements these benchmarks is available in the “examples” folder of the software.

3 Results

3.1 Fractional ridge regression achieves the desired outcomes

In simulations, we demonstrate that the fractional ridge regression (FRR) algorithm accurately produces the desired fractions $\gamma$ (Figure 2 A,B second row, right column in each). We compare the results of FRR to results of standard ridge regression (SRR), in which a commonly-used heuristic is used to select $\alpha$ values (log-spaced values spanning a large range). For the SRR approach, we find that the fractional L2-norm is very small and virtually indistinguishable for large values of $\alpha$ , and is very similar to the OLS solution (fractional L2-norm approximately 1) for several small values of $\alpha$ (Figure 2 A, B second row, left column). In addition, cross-validation accuracy is indistinguishable for many of the values of $\alpha$ evaluated in SRR. Only very few values of $\alpha$ produce cross-validated $R^{2}$ values that are similar to the value provided by the best $\alpha$ (Figure 2 A, B first row, left column).

The SRR results can also be re-represented using effective degrees of freedom (DOF; Figure 2 A, B first row, middle column): several values of $\alpha$ result in essentially the same number of DOF, because these values are either much larger than the largest singular value or much smaller than the smallest singular value of $X$ . In contrast to SRR, FRR produces a nicely behaved range of cross-validated $R^{2}$ values and dense sampling around the peak $R^{2}$ .

Another line of evidence highlighting the diversity of the solutions provided by FRR is given by inspecting coefficient paths: in the log-spaced case, coefficients start very close to 0 (for high $\alpha$ ) and rapidly increase (for lower $\alpha$ ). Even when re-represented using DOF, the coefficient paths exhibit some redundancy. In contrast, FRR provides more gradual change in the coefficient paths, indicating that this approach explores the space of possible coefficient configurations more uniformly. Taken together, these analyses demonstrate that FRR provides a more useful range of regularization levels than SRR.

Refer to caption — Figure 2: Fractional ridge regression (FRR) achieves desired outcomes. (A) Example regression scenario ( $d=100$ , $p=5$ ). The first two columns show the results of standard ridge regression in which log-spaced $\alpha$ values are used to obtain different levels of regularization. Whereas the first column shows results as a function of $log_{10}(\alpha)$ , the second column shows results as a function of $\alpha$ values converted to effective degrees of freedom (see Methods). The third column shows the results of fractional ridge regression in which different regularization levels are achieved by requesting specific fractional L2-norm ( $\gamma$ ). Solid blue dots mark peak cross-validation performance. Vertical gray lines in the third column indicate regression solutions obtained by the FRR method (requested fractions range from 0 to 1 in increments of 0.05). The corresponding locations of these regression solutions in the first and second columns are also shown using vertical gray lines. The bottom row shows coefficient paths, i.e., the values of $\beta$ as a function of $log_{10}(\alpha)$ , degrees of freedom, or fraction $\gamma$ . (B) Example regression scenario ( $d=100$ , $p=100$ ). Same format as panel A. Notice that in both scenarios, only the FRR method achieves regression solutions whose L2-norms increase linearly, with gradually changing coefficient paths.

3.2 FRR is computationally efficient

A question of relevance to potential users of FRR is whether using the method incurs significant computational cost. We compare FRR to two alternative approaches. The first approach is a naïve implementation of the matrix inversion specified in equation 3, in which the Moore-Penrose pseudo-inverse (implemented as pinv in Matlab and numpy.linalg.pinv in Python) is performed independently for each setting of hyperparameter $\alpha$ . The second approach takes advantage of the computational expedience of the SVD-based approach: instead of a matrix inversion for each $\alpha$ value, a single SVD is performed, a transformation (rotation) is applied to the data, and different values of $\alpha$ are plugged into equation 14 to compute the regression coefficients. This approach comprises a subset of the operations taken in FRR. Therefore, it represents a lower bound in terms of computational requirements.

Through systematic exploration of different problem sizes, we find that FRR performs quite favorably. FRR differs from the rotation-based approach only slightly with respect to execution-time scaling in the number of data points (Figure 3A, left column), in the number of parameters (Figure 3A, right column), and in $f$ , the number of hyperparameter values considered (Figure 3A, third column ). The naïve matrix-inversion approach is faster than both SVD-based approaches (FRR and rotation-based) for $f<20$ , but rapidly becomes much more costly for values above 20. This approach also scales rather poorly for $p>5,000$ .

In terms of memory consumption, the mean and maximum memory usage are very similar for FRR, the naïve and rotation-based solutions. These results suggest that for each of these approaches, the matrix inversion (for the naïve implementation of SRR) or the SVD (for FRR and the SVD-based SRR) represents the main computational bottleneck. Importantly, despite the fact that FRR uses additional gridding and interpolation steps, it does not perform substantially worse than either of the other approaches.

3.3 Application of FRR on real-world data

To demonstrate the practical utility of FRR, we explore its application in a specific scientific use-case. Data from a functional Magnetic Resonance Imaging (fMRI) experiment were analyzed with FRR and the results of this analysis were compared to a standard ridge regression (SRR) approach where $\alpha$ values are selected using a log-spaced heuristic. Different parts of the brain process different types of information, and a large swath of the cerebral cortex is known to respond to visual stimulation. Experiments that combine fMRI with computational analysis provide detailed information about the responses of different parts of the brain (Wandell et al., 2015). In the experiments analyzed here, a series of images are shown and the MRI blood-oxygen-level-dependent (BOLD) signal is recorded in a sampling grid of voxels throughout the brain (Figure 4A). In the cerebral cortex, each voxel contains hundreds of thousands of neurons. If these neurons respond vigorously to the visual stimulus presented, the metabolic demand for oxygen in that part of cortex will drive a transient increase in oxygenated blood in that region, and the BOLD response will increase. Thus, a model of the BOLD response tells us about the selective responses of neurons in each voxel in cortex.

Because neurons in parts of the cerebral cortex that respond to visual stimuli are known to be particularly sensitive to local contrast, we model responses with respect to the standard deviation of luminance in each region of the image, rather than the luminance values themselves (Figure 4B). In the model, $Y$ contains brain responses where each target (column) represents the responses in a single voxel. Each row contains the response of all voxels to a particular image. The design matrix $X$ contains the local contrast in every region of the image, for every image. This means that the coefficients $\beta$ represent weights on the stimulus image and indicate each voxel’s spatial selectivity – i.e., the part of the image to which the voxel responds (Wandell and Winawer, 2015). Therefore, one way to visualize $\hat{\beta}$ is to organize it according to the two-dimensional layout of the image (Figure 4C&D, bottom two rows).

Using FRR, we fit the model to voxel responses, and find robust model performance in the posterior part of the brain where visual cortex resides (left part of the horizontal slice presented in the top row of Figure 4C). The performance of the model can be observed in either the cross-validated $R^{2}$ values (Figure 4C, top row, left and middle panels) or the value of $\gamma$ corresponding to the best cross-validated $R^{2}$ . For example, we can focus on the two voxels highlighted in the middle panel of the top row in Figure 4C. One voxel, whose characteristics are further broken down in Figure 4D has lower cross-validated $R^{2}=4\%$ and requires stronger regularization ( $\gamma=0.15$ ). The spatial selectivity of this voxel’s responses becomes very noisy at large $\gamma$ values and $R^{2}$ approaches 0. On the other hand, the voxel in Figure 4E has a higher $\gamma=0.35$ and a higher cross-validated $R^{2}=13\%$ . Moreover, this voxel appears more robust with higher values of $\gamma$ producing less spatially noisy results. The map of $R^{2}$ and $\gamma$ illustrated in Figure 4C show that these trends hold more generally: voxels with more accurate models require less regularization.

4 Discussion

The main theoretical contribution of this work is a novel approach to hyperparameter specification in ridge regression. Instead of the standard approach in which a heuristic range of values for hyperparameter $\alpha$ are evaluated for their accuracy, the fractional ridge regression (FRR) approach focuses on achieving specific fractions for the L2-norms of the solutions relative to the L2-norm of the unregularized solution. In a sense, this is exactly in line with the original spirit of ridge regression, which places a penalty on the L2-norm of the solution. The main practical contribution of this work is the design and implementation of an efficient algorithm to solve FRR and validation of this algorithm on simulated and empirical data. Overall, we suggest that FRR can serve as a default approach to solving ridge regression.

4.1 The benefits of FRR

1.

Theoretically-motivated and principled. The results in section 3.1 demonstrate that the theoretical motivation described in the Methods holds in practice. Our implementation of FRR produces ridge regression solutions that have predictable and tuneable fractional L2-norm.
2.

Statistically efficient. Each fraction level returned by FRR produces $\beta$ values that are distinctly different. This avoids the common pitfall in the log-spaced approach whereby computation is wasted on several values of $\alpha$ that all over-regularize or under-regularize. When used with a range of $\gamma$ values from 0 to 1, the solution that minimizes cross-validation error is guaranteed to exist within this range (although it may lie in between two of the obtained solutions).
3.

Computationally efficient. We show that our implementation of FRR requires memory and computational time that are comparable to a naïve ridge regression approach and to an approach that uses SVD but relies on preset $\alpha$ values. SVD-based approaches (including FRR) scale linearly in $f$ , with compute-time scaling better than naïve RR in the $f>20$ regime. In practice, we have found that $f=20$ evenly distributed values between 0 and 1 provide sufficient coverage for many problems. But the linear scaling implies that sampling more finely would not be limiting in cases where additional precision is needed.
4.

Interpretable. FRR uses $\gamma$ values that represent scaling relative to the L2-norm of the OLS solution. This allows FRR results to be compared across different targets within a dataset. This is exemplified in section 3.3, in which results from an fMRI experiment are interpreted both in light of cross-validated $R^{2}$ and the optimal $\gamma$ that leads to the best cross-validated $R^{2}$ . Moreover, regularization in different datasets and for different models (e.g., different settings of $X$ ) can be compared to each other as being stronger or weaker. The optimal regularization level can be informative regarding the signal-to-noise of a particular target or about the level of collinearity of the design matrix (which both influence the optimal level of regularization). FRR increases the interpretability of ridge regression, because instead of an unscaled, relatively inscrutable value of $\alpha$ , we receive a scaled, relatively interpretable value. Based on a recently proposed framework for interpretability in machine learning methods (Murdoch et al., 2019), we believe that this kind of advance improves the descriptive accuracy of ridge regression.
5.

Automatic. Machine learning algorithms focus on automated inferences, but many machine learning algorithms still require substantial manual tuning. For example, if the range of $\alpha$ values used is not sufficient, users of ridge regression may be forced to explore other values. This is impractical in cases in which thousands of targets are analyzed and multiple models evaluated. Thus, FRR contributes to the growing field of methods that aim to automate machine learning methods (Zöller and Huber, 2019; Tuggener et al., 2019). These methods all aim to remove the burden of manual inspection and tuning of machine learning. A major benefit of FRR is therefore practical in nature: Because FRR spans the dynamic range of effects that ridge regression can provide, using FRR guarantees that the time taken to explore hyperparameter values is well spent. Moreover, the user does not have to spend time speculating what $\alpha$ values might be appropriate for a given problem (e.g. is $10^{4}$ sufficiently high?).
6.

Implemented in usable open-source software. We provide code that is well-documented, thoroughly tested, and easy to use: https://github.com/nrdg/fracridge. The software is available in two popular statistical programming languages: MATLAB and Python. The Python implementation provides an object-oriented interface that complies with the popular Scikit-Learn library (Pedregosa et al., 2011; Buitinck et al., 2013).

4.2 Limitations

One limitation of FRR is that a heuristic approach is used within the algorithm to generate the grid of $\alpha$ values used for interpolation (see section 2.2 for details). Nonetheless, the interpolation results are quite accurate, and costly computations are carried out only for final desired $\alpha$ values. Another limitation is that the $\alpha$ value that corresponds to a specific $\gamma$ may be different for different targets and models. If there are theoretical reasons to retain the same $\alpha$ across targets and models, the FRR approach is not appropriate. But this would rarely be the case, as $\alpha$ values are usually not directly interpretable. Alternatively, FRR can be used to estimate values of $\alpha$ on one sample of the data (or for one model) and these values of $\alpha$ can then be used in all of the data (or all models).

Finally, the FRR approach is limited to ridge regression and does not generalize easily to other regularization approaches. The Lasso (Tibshirani, 1996) provides regression solutions that balance least-squares minimization with the L1-norm of the coefficients, rather than the L2-norm of the coefficients. The Lasso approach has several benefits, including results that are more sparse and potentially easier to interpret. Similarly, Elastic Net (Zou and Hastie, 2005) uses both L1- and L2-regularization, potentially offering more accurate solutions. But because the computational implementation of these approaches differs quite substantially from ridge regression, the approach presented in this paper does not translate easily to these methods. Moreover, while these methods allow regularization with a non-negativity constraint on the coefficients, this constraint is not easily incorporated into L2-regularization. On the other hand, a major challenge that arises in L1-regularization is computational time: most algorithms operate for one target at a time and incur substantial computational costs, and scaling such algorithms to the thousands of targets in large-scale datasets may be difficult.

4.3 Future extensions

An important extension of the present work would be an implementation of these ideas in additional statistical programming languages, such as the R programming language, which is very popular for use in statistical analysis of data from many different domains. One of the most important tools for regularized regression is the glmnet software package which was originally implemented in the R programming language (Friedman et al., 2009) and has implementations in MATLAB (Qian et al., 2013) and Python (Balakumar, 2016). The software also provides tools for analysis and visualization of coefficient paths and of the effects of regularization on cross-validated error. The R glmnet vignette (Hastie and Qian, 2014) demonstrates the use of these tools. In addition to identifying the $\alpha$ value that minimizes cross-validation error, glmnet also identifies the $\alpha$ which gives the most regularized model such that the cross-validated error is within one standard error of the minimum cross-validated error. This approach acknowledges that there is some error in selecting $\alpha$ and chooses to err on the side of a more parsimonious model (Friedman et al., 2010). Future extensions of FRR could implement this heuristic.

Acknowledgments

The authors would like to thank Noah Simon for helpful discussions. AR was funded through a grant from the Gordon & Betty Moore Foundation and the Alfred P. Sloan Foundation to the University of Washington eScience Institute, through NIH grants 1RF1MH121868-01 (PI: AR) from the National Institute for Mental Health and 5R01EB027585-02 (PI: Eleftherios Garyfallidis, Indiana University) from the National Institute for Biomedical Imaging and Bioengineering and through NSF grants 1934292 (PI: Magda Balazinska, University of Washington). KK was supported by NIH P41 EB015894. Collection of MRI data was supported by NIH S10 RR026783 and the W.M. Keck Foundation.

References

Balakumar (2016) Hastie T. Friedman J. Tibshirani R. Simon N. Balakumar, B.J. Glmnet for python, 2016, 2016. URL https://web.stanford.edu/~hastie/glmnet_python/.
Buitinck et al. (2013) Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, et al. Api design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238, 2013.
Friedman et al. (2009) Jerome Friedman, Trevor Hastie, and Rob Tibshirani. glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1(4), 2009.
Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
Han-Chen (2020) Daniel Han-Chen. hyperlearn, 2020. URL https://github.com/danielhanchen/hyperlearn/.
Hastie and Qian (2014) Trevor Hastie and Junyang Qian. Glmnet vignette, 2014. URL http://www.web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf.
Hastie and Tibshirani (2004) Trevor Hastie and Robert Tibshirani. Efficient quadratic regularization for expression arrays. Biostatistics, 5(3):329–340, July 2004.
Hastie et al. (2001) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
Hoerl and Kennard (1970) Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
Krekel et al. (2004–) Holger Krekel, Bruno Oliveira, Ronny Pfannschmidt, Floris Bruynooghe, Brianna Laugher, and Florian Bruhin. pytest 5.4.1, 2004–. URL https://github.com/pytest-dev/pytest.
Lam et al. (2015) Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: a LLVM-based python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, number Article 7 in LLVM ’15, pages 1–6, New York, NY, USA, November 2015. Association for Computing Machinery.
Mier (2020) Jose M Mier. Quicker 1d linear interpolation: interp1qr, 2020. URL https://www.mathworks.com/matlabcentral/fileexchange/43325-quicker-1d-linear-interpolation-interp1qr.
Murdoch et al. (2019) W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U. S. A., 116(44):22071–22080, October 2019.
Òscar Nájera et al. (2020) Òscar Nájera, Eric Larson, Loïc Estève, Gael Varoquaux, Jaques Grobler, Lucy Liu, Elliott Sales de Andrade, Chris Holdgraf, Alexandre Gramfort, Mainak Jas, Joel Nothman, Olivier Grisel, Nelle Varoquaux, Emmanuelle Gouillart, Martin Luessi, Antony Lee, Jake Vanderplas, Tim Hoffmann, Thomas A Caswell, Bane Sullivan, Alyssa Batula, Adrian Price-Whelan, jaeilepp, Thomas Robitaille, Stefan Appelhoff, Patrick Kunzmann, Matthias Geier, Lars, Kyle Sunden, and Dominik Stańczak. sphinx-gallery/sphinx-gallery: Release v0.6.1, April 2020. URL https://doi.org/10.5281/zenodo.3741781.
Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
Qian et al. (2013) Junyang Qian, T Hastie, J Friedman, R Tibshirani, and N Simon. Glmnet for matlab, 2013, 2013. URL http://www.stanford.edu/hastie/glmnetmatlab.
Skouras et al. (1994) K Skouras, C Goutis, and MJ Bramson. Estimation in linear models using gradient descent with early stopping. Statistics and Computing, 4(4):271–278, 1994.
Stone (1978) M Stone. Cross-validation: A review. Statistics: A Journal of Theoretical and Applied Statistics, 9(1):127–139, 1978.
Stone (1974) Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–133, 1974.
Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Tikhonov and Arsenin (1977) Andrey N Tikhonov and Vasiliy Y Arsenin. Solutions of ill-posed problems. Wiley, 1977.
Tuggener et al. (2019) Lukas Tuggener, Mohammadreza Amirian, Katharina Rombach, Stefan Lörwald, Anastasia Varlet, Christian Westermann, and Thilo Stadelmann. Automated machine learning in practice: State of the art and recent results. July 2019.
van der Walt et al. (2011) S van der Walt, S C Colbert, and G Varoquaux. The NumPy array: A structure for efficient numerical computation. Computing in Science Engineering, 13(2):22–30, March 2011.
Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J van der Walt, Matthew Brett, Joshua Wilson, K Jarrod Millman, Nikolay Mayorov, Andrew R J Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E A Quintero, Charles R Harris, Anne M Archibald, Antônio H Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods, 17(3):261–272, March 2020.
Wandell et al. (2015) BA Wandell, J Winawer, and KN Kay. Computational modeling of responses in human visual cortex. In Acquisition Methods, Methods and Modeling, pages 651–659. Elsevier Inc., 2015.
Wandell and Winawer (2015) Brian A Wandell and Jonathan Winawer. Computational neuroimaging and population receptive fields. Trends Cogn. Sci., 19(6):349–357, June 2015.
Zöller and Huber (2019) Marc-André Zöller and Marco F Huber. Benchmark and survey of automated machine learning frameworks. April 2019.
Zou and Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.