This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newfloatcommand

capbtabboxtable[][\FBwidth]

A Novel Explanation Against Linear Neural Networks

Anish Lakkapragada 1
1Lynbrook High School, San Jose, CA 95129
{Email:[email protected]}
Abstract

Linear Regression and neural networks are widely used to model data. Neural networks distinguish themselves from linear regression with their use of activation functions that enable modeling nonlinear functions. The standard argument for these activation functions is that without them, neural networks only can model a line. However, a novel explanation we propose in this paper for the impracticality of neural networks without activation functions, or linear neural networks, is that they actually reduce both training and testing performance. Having more parameters makes LNNs harder to optimize, and thus they require more training iterations than linear regression to even potentially converge to the optimal solution. We prove this hypothesis through an analysis of the optimization of an LNN and rigorous testing comparing the performance between both LNNs and linear regression on synthethic, noisy datasets.

1 Introduction

Neural networks [1] distinguish themselves from linear regression by their ability to model nonlinear data. This capability comes from their nonlinear activation functions. The standard explanation against neural networks without such activation functions, which we refer to as linear neural networks (LNNs), is that they only can model lines and thus yield no benefit compared to linear regression.

In this paper, we propose a novel reason for the impracticality of LNNs: LNNs actually perform worse than linear regression, despite modeling the same form of data. The excess of parameters in LNNs corrupts the optimization process thus preventing LNN training to yield the optimal solution. We test our hypothesis through a debrief of optimization procedures on an LNN and perform experiments on synthethic datasets of various noisiness.

2 Methods

If we have a univariate dataset XX and associated labels yy, assuming the relationship between XX and yy is linear, a linear regression model given by the equation y^i\hat{y}_{i} = axi+bax_{i}+b can be created where y^i\hat{y}_{i} is the prediction for the input xix_{i}. If this model was fully optimized, aa and bb would be the weight and bias respectively to minimize the mean of the squared residuals.

Neural networks for univariate data can similarly be constructed as the following. The output vector for the first layer z1z_{1} is given by z1=w1x+b1z_{1}=w_{1}x+b_{1}. wnw_{n} and bnb_{n} denote the weight and bias for the nnth layer. The output of an LNN with a second layer would then be w2z1+b2w_{2}z_{1}+b_{2} or w2w1x+w2b1+b2w_{2}w_{1}x+w_{2}b_{1}+b_{2}.

LNNs require iterative optimization, such as Gradient Descent (GD), to optimally adjust their parameters. GD updates each of current parameters based on the derivative of the objective function jj with respect to that parameter.Given learning rate α\alpha and any parameter at time step tt, GD will update the parameter to pt+1p_{t+1} as such: pt+1=ptαdJdpp_{t+1}=p_{t}-\alpha\frac{dJ}{dp}. In our case, our objective function is the mean squared error (MSE) given by J=1Ni=1N(y^iyi)2J=\frac{1}{N}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i})^{2}. The derivatives used to optimize a linear regression parameters m,b{m,b} through such optimization are shown in Equation 1.

dJdm=2Ni=1N(y^iyi)xi;dJdb=2Ni=1N(y^iyi)\frac{dJ}{dm}=\frac{2}{N}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i})x_{i};\frac{dJ}{db}=\frac{2}{N}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i}) (1)

LNN optimization is more cumbersome because of the increased amount of parameters. For the two-layered LNN given by w2w1x+w2b1+b2w_{2}w_{1}x+w_{2}b_{1}+b_{2}, the optimal parameter solution is for w2w1=a;w2b1+b2=bw_{2}w_{1}=a;w_{2}b_{1}+b_{2}=b so that the LNN’s prediction function simplifies to the ax+bax+b. Because the derivative of any parameter depends on parameters from previous layers, this makes this solution harder to reach. Given the derivative of JJ with respect to w2w_{2} used to optimize w2w_{2}:

dJdw2=2Ni=1N(y^iyi)(w1xi+b1)\frac{dJ}{dw_{2}}=\frac{2}{N}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i})(w_{1}x_{i}+b_{1}) (2)

we can see that the next step of w2w_{2} by GD would be based on the currently suboptimal parameters w1w_{1} and b1b_{1}. In order for the optimal solution w2w1=aw_{2}w_{1}=a to be met, this means the new value of w2w_{2}, calculated on a suboptimal w1w_{1}, and w1w_{1} have to align such that their product is aa. This will realistically only happen if the LNN begins training with a parametrization initialization where w2w1=aw_{2}w_{1}=a. GD initializes parameters randomly, so this particular arrangement is extremely unlikely. The high interdependency between parameters and their movements across iterations creates difficulty for an LNN’s parameters to arrive at the optimal solution. Note that these same dynamics apply to the optimization of the bias parameter. Through this demonstration, it can be seen how this problem will be further exacerbated if the LNN had more layers, and thus more parameters.

3 Experiments

We compare the performance of linear regression and LNNs from 2 to 10 layers on synthetic datasets with varying levels of noise.

Data

For simplicity, all of our data in our experiments are univariate. Note that even if our data was multivariate, the same results would occur as linear regression or LNNs on multivariate data essentially operates the same across each dimension.

We first sample the input data vector xx from a standard normal distribution. We randomly sample scalars aa and bb from the same distribution as the respective true weight and bias parameters of the data. This gives us yy, the label vector, equal to ax+bax+b. Because no realistic data is perfectly linear we add noise to our dataset. We sample noise from a standard normal distribution and then scale the noise to the magnitude of the pre-existing data by multiplying it by the expectation of yy. This scaled noise is then multiplied by a noise coefficient β\beta, which controls the extent to which the labels yy are corrupted by noise. Finally this noise scaled to the magnitude of the dataset is added to the pre-existing labels yy to give the noisy labels, ynoisey_{noise}. In equation form, our noisy labels are given by:

ynoise=ax+b+β𝒩(0,1)𝔼[ax+b]y_{noise}=ax+b+\beta*\mathcal{N}(0,1)*\mathbb{E}[ax+b] (3)

For the new noisy dataset, the new optimal weight is denoted as aa^{*} and optimal bias as bb^{*}.

Results

We compare the performances of a linear regression model to LNNs with 2 to 10 layers. For each experiment, using the aforementioned data procedure, we generate a 1000-length data and label vector for model training and a 200-length data and label vector model evaluation. Both datasets are generated with the same noise coefficient. We first train each model on the training data to convergence. At each iteration, we track the model’s MSE on the train and test datasets.

Additionally, we track the model’s parameters deviation from the optimal weight and bias at iteration.We calculate the deviation of a given model’s parameters from the optimal solution by first applying the Normal Equation, a closed-form solution, on the training data to solve for optimal weight aa^{*} and optimal bias bb^{*}. Because all models are a linear function, we can simplify all models to a linear function mx+bmx+b and then measure the model’s optimal parameter deviation DD as |ma|+|bb||m-a^{*}|+|b-b^{*}|. Over the iterations, this deviation should reduce.

We perform this experiment 100 times for each of the noise coefficient values 0.05, 0.15, 0.3, and 0.5. We write our models in PyTorch [2] and train them with SGD [3] using a learning rate of 0.001. We report the testing mean and standard deviations of the MSE (across all 100 experiments) for all models and noise coefficients in Table 1. Figure 1 shows the average optimal parameter deviation DD throughout training over the 100 experiments for each model with β=0.05\beta=0.05. Figure 2 shows the sharp increases in MSE as the LNN parameter count (or number of layers) increase across all noise levels.

[Uncaptioned image]
Figure 1: Plot of the average optimal parameter deviation DD for each model across all 100 training runs.
Noise Coefficient β\beta
Model 0.05 0.15 0.30 0.50
LinReg 0.0028 ±0.005 0.0197 ±0.025 0.086449 ±0.1197 0.2840 ±0.4667
LNN-2 0.003 ±0.006 0.020 ±0.025 0.086451 ±0.1197 0.2842 ±0.4668
LNN-3 0.004 ±0.007 0.023 ±0.04 0.09 ±0.1194 0.2844 ±0.4665
LNN-4 0.05 ±0.27 0.03 ±0.05 0.101 ±0.13 0.30 ±0.47
LNN-5 0.08 ±0.28 0.09 ±0.26 0.196 ±0.42 0.36 ±0.61
LNN-6 0.21 ±0.55 0.19 ±0.58 0.26 ±0.59 0.55 ±0.9
LNN-7 0.39 ±0.85 0.40 ±0.98 0.52 ±1.02 0.82 ±1.32
LNN-8 0.69 ±1.48 0.74 ±1.14 0.61 ±0.87 1.01 ±1.35
LNN-9 0.87 ±1.27 0.74 ±1.08 0.72 ±1.06 1.08 ±1.45
LNN-10 0.98 ±1.35 0.90 ±1.33 0.94 ±1.17 1.10 ±1.296
Table 1: Means and standard deviations of testing MSE measured over 100 runs for all models and noise coefficients. LNN-nn refers to an LNN with nn layers.
Refer to caption
Figure 2: Trendlines of testing MSE as LNN parameter count/layers increases across all noise levels.

Discussion

The optimal parameter solution D=0D=0 is achieved only by linear regression and LNNs with a few layers. LNNs with more layers typically converge at increasingly suboptimal solutions despite being provided an excessive number of iterations. This highlights the empirical difficulty of excess parameters in optimization, showing both training and testing performance suffer.

4 Conclusion

We are the first to propose a novel explanation against neural networks without activation functions. We prove the superiority of linear regressions compared to linear neural networks by a comparison of their optimization. We validate this proof by testing linear regression and LNNs on different levels of noise across 100 datasets for each level. We conclude LNNs perform worse in training and tesitng than linear regression due to more difficult optimization caused by their excess parameters.

References

  • [1] Warren S McCulloch and Walter Pitts “A logical calculus of the ideas immanent in nervous activity” In The bulletin of mathematical biophysics 5 Springer, 1943, pp. 115–133
  • [2] Adam Paszke et al. “Pytorch: An imperative style, high-performance deep learning library” In Advances in neural information processing systems 32, 2019
  • [3] Herbert Robbins and Sutton Monro “A stochastic approximation method” In The annals of mathematical statistics JSTOR, 1951, pp. 400–407