Consensus Function from an $L_{p}^{q}-$ norm Regularization Term for its Use as Adaptive Activation Functions in Neural Networks

Juan Heredia-Juesas and José Á. Martínez-Lorenzo J. Heredia-Juesas and J. Á. Martínez-Lorenzo are with the Departments of Electrical & Computer Engineering, and Mechanical & Industrial Engineering, Northeastern University, Boston, MA 02115 USA. email: [email protected].

Abstract

The design of a neural network is usually carried out by defining the number of layers, the number of neurons per layer, their connections or synapses, and the activation function that they will execute. The training process tries to optimize the weights assigned to those connections, together with the biases of the neurons, to better fit the training data. However, the definition of the activation functions is, in general, determined in the design process and not modified during the training, meaning that their behavior is unrelated to the training data set. In this paper we propose the definition and utilization of an implicit, parametric, non-linear activation function that adapts its shape during the training process. This fact increases the space of parameters to optimize within the network, but it allows a greater flexibility and generalizes the concept of neural networks. Furthermore, it simplifies the architectural design since the same activation function definition can be employed in each neuron, letting the training process to optimize their parameters and, thus, their behavior. Our proposed activation function comes from the definition of the consensus variable from the optimization of a linear underdetermined problem with an $L_{p}^{q}$ regularization term, via the Alternating Direction Method of Multipliers (ADMM). We define the neural networks using this type of activation functions as $pq-$ networks. Preliminary results show that the use of these neural networks with this type of adaptive activation functions reduces the error in regression and classification examples, compared to equivalent regular feedforward neural networks with fixed activation functions.

Index Terms:

Activation Functions, ADMM, Functional Analysis,

L_{p}^{q}-

norm regularization, Neural Networks

I Introduction

Neural Networks are widely used for different applications, such as identification, security, defense, face and speech recognition, healthcare, or weather and stock market prediction, among others, [1, 2, 3, 4, 5, 6, 7, 8]. Although the architectural design of the neural networks may amply vary depending on the particular application, they all share the basic intrinsic foundation of mimicking the behavior of a biological brain by connecting several computational nodes, the so-called neurons, that perform simple non-linear operations, with the aim of developing complex practices. Mathematically, neural networks are a powerful tool that can basically describe any relationship between any given input-output set of data, potentially modeling any imaginable non-linear mapping. Despite its questionless capabilities, neural networks rely on the simplicity of connecting several layers of neurons that enchains two straightforward operations: a linear combination of the results provided from the neurons of a previous layer, and a non-linear function, which outcome is passed to the next layer. The number of layers, the number of neurons per layer, and the type of non-linear functions implemented to the neurons of each layer, the so-called activation or transfer function, is part of the architectural design of the neural network. The larger the network, the more complex capabilities and highly non-linearities could be modeled, [9].

The fitting of the data to the designed network is done by adapting the weights and biases that define the linear combination on the first operation done in each neuron during the training process. However, the selection of the activation functions generally is reduced to a limited set of non-linear functions, such as the sigmoid, rectified linear units, max-pooling, etc. This selection has, a priori, little or no influence from the dataset that is used for training the network. The exploration of adaptive activation functions to increase the flexibility of the neural networks during the training process is an active topic of research. [10] points out the importance that the activation functions have on the neural network performance and review some initial attempts to find an optimal distribution of activation functions from a reduced set of options, by using evolutionary algorithms. [11] defines the maxout activation function, which extracts the maximum of a set of linear functions, arbitrarily approximating any convex function. [12] proposes adaptive piecewise linear units that are learned using gradient descend during the training process, enabling the approximation of both convex and non-convex functions; and [13] introduces an $L_{p}$ unit, which performs the $p-$ norm of the normalized input data, adapting the values of $p$ during the training process and generalizing the pooling operator, for which the maxpooling and maxout units are particular cases of it. These methods progress towards neural network architectures whose training process learns, not only the relationships among the neurons, but also the neurons themselves. However, they do not totally generalize the activation functions and require software applications to approximate the gradients during the training process, thus, a better understanding of the underlying functional analysis of generalized activation functions is still required.

In this paper, we present a parametric activation function whose parameters are learned during the training process and that can adapt to mimic most of the common activation functions regularly used. On one hand, this imply to increase the space of training parameters, but on the other hand, this generalizes the global structure of the neurons, since their activation functions do not need to be predefined. The definition of the proposed activation function comes from the functional analysis of the consensus variable when performing a linear optimization problem with a general $L_{p}^{q}-$ norm regularization term, via the Alternating Direction Method of Multipliers (ADMM), as it is described in Sect. II. The mathematical development for the use of the proposed activation function, and the analysis for the computation of the gradients for the optimization of the implied parameters is detailed in Sect. III. Section IV presents some preliminary results, showing that the use of these adaptive activation functions reduces both the training and testing errors when they are applied on shallow neural networks, compared to equivalent regular feedforward networks in which the activation functions are predetermined. Finally, Sect. V concludes the paper.

II Activation function definition

In this work, we propose an implicit, parametric, non-linear activation function that can adapt its shape during the training process to create a flexible neural network. This function appears as the result of the research carried out by the same authors on the field of optimization of linear under-determined problems, via the ADMM, [14, 15, 16, 17, 18, 19, 20, 21, 22]. Specifically, the linear problem $\textbf{Hu}=\textbf{g}$ can be optimized with an $L_{p}^{q}-$ norm regularization term as follows:

\left.\begin{array}[]{cc}\mbox{minimize }&\frac{1}{2}\sum\limits_{i=1}^{M}\left\|\textbf{H}_{i}\textbf{u}^{i}-\textbf{g}_{i}\right\|_{2}^{2}+\lambda\left\|\textbf{v}\right\|_{p}^{q}\\ \mbox{s.t.}&\textbf{u}^{i}=\textbf{v},\;\;\;\forall i=1,\dots,M\\ &p,q\geq 1\end{array}\right.

(1)

On one hand, $\textbf{H}_{i}$ is each of the $M$ submatrices in which $\textbf{H}\in\mathbb{C}^{N_{m}\times N_{p}}$ is divided by rows, where $N_{m}$ is the number of data and $N_{p}$ the number of unknowns, usually having $N_{p}\gg N_{m}$ . $\textbf{g}_{i}$ is each of the $M$ subvectors in which $\textbf{g}\in\mathbb{C}^{N_{m}}$ is divided; and $\textbf{u}^{i}\in\mathbb{C}^{N_{p}}$ is each of the $M$ independent solutions that can be generated for each division. The optimization process using the ADMM algorithm imposes the agreement of all solutions via the consensus variable $\textbf{v}\in\mathbb{C}^{N_{p}}$ , as depicted in Fig. 1.

Refer to caption — Figure 1: (a) Graphical representation of the process of dividing the matrix equation system by rows. (b) Hierarchical architecture of the consensus-based ADMM, where a central node collects the updates of the $M$ sub-nodes, performs a non-linear averaging, and distributes the solution back to the sub-nodes, to converge on a consensus solution.

On the other hand, the regularization term $\lambda\left\|\textbf{v}\right\|_{p}^{q}$ imposes a certain structure in the particular solution sought for the problem, where $\lambda$ is just a design hyperparameter to weight the importance of this type of structure in the optimization process. The parameter $p$ induces a preferred direction of searching for the solution. In this sense, a norm-1 regularization (lasso) imposes a sparse solution, [22, 23, 24], a norm-2 regularization (ridge regression) seeks for the solution with minimum energy [25, 26, 27], meanwhile a norm-infinity minimizes the magnitude of the greatest component [28, 29, 30]. As a more general expression, the bridge regression regularization term, of the form $\|\textbf{v}\|_{p}^{p}$ , accounts for the previously mentioned lasso for $p=1$ and ridge regression for $p=2$ , and performs a smooth version of the elastic net regularization for $1<p<2$ , [31, 32, 33, 18, 34, 35]. Meanwhile, the parameter $q$ defines the way the distances are measured, distorting the metric space. Particularly, for large values of $p$ , the distance between two different points tends to be very similar, and specifically, $p=\infty$ , leads to the discrete distance.

The optimization problem on (1) is solved by introducing the Lagrangian multipliers or dual variables $\textbf{s}^{i}$ , one for each constrain, and sequentially optimizing $\textbf{u}^{i}$ , v, and $\textbf{s}^{i}$ , in this order. See Ref. [21] for a detailed explanation. The optimization of the primal and dual variables $\textbf{u}^{i}$ and $\textbf{s}^{i}$ are always the same regardless the type of regularization, since this one only affects the definition of the consensus variable v. Its optimization, therefore, comes from the computation of the gradient of the Lagrangian with respect to each component $v_{l}$ :

\frac{\partial L_{\rho}}{\partial v_{l}}=\lambda q\frac{|v_{l}|^{p-1}}{\|\textbf{v}\|_{p}^{p-q}}\frac{\partial}{\partial v_{l}}|v_{l}|+\rho M\left(v_{l}-\left(\bar{u}_{l}+\bar{s}_{l}\right)\right)\ni 0,

(2)

where $\rho$ is the augmented parameter. Notice that the input variable in the expression (2) is $\left(\bar{u}_{l}+\bar{s}_{l}\right)$ , namely, the sum of the average of the $l-th$ components of the variables $\textbf{u}^{i}$ and $\textbf{s}^{i}$ , while the output is, indeed, $v_{l}$ . Consequently, this is, in general, a non explicit expression except for some particular values of $p$ and $q$ , and it has to be solved by iteratively reweighting the expression with the previous iterations of $v_{l}$ .

Definitely, the selection of the values of $p$ and $q$ for the regularization term will determine the shape of this implicit function. Interestingly, the behavior of this function resemble the definition of the common activation functions, such as the ReLU ( $p=1$ ), PReLU ( $1<p<2$ ), linear ( $p=2$ ), sigmoid ( $2<p=q<\infty$ ), or soft clipping ( $p=q=\infty$ ), as depicted in Figure 2, [36, 37]. By redefining $a_{l}=\bar{u}_{l}+\bar{s}_{l}$ as the input in Eqn. (2), dropping the component dependence $l$ , grouping and redefining the constant $\lambda:=\frac{\lambda}{\rho M}>0$ as a tuning hyperparameter, and for the particular case of $1<p=q<\infty$ , we define the implicit, non-linear, parametric activation function as follows:

R(a,v;p)=\lambda p|v|^{p-1}\text{sign}(v)+\left(v-a\right)=0,

(3)

which defines the non-linear mapping

v=f(a;p),

(4)

where $a$ is the input and so-called activation variable, $p$ is the learning parameter of the function, and $v$ is the output of the activation function $f$ . We call the $\bm{p}$ -network to the neural networks designed with this type of activation function. The general case for $p\neq q$ would lead to the denominated $\bm{pq}$ -network, which will extend this work in the future.

III $p-$ Network implementation

The top part of Fig. 3 depicts an example of a fully connected $p-$ network of $m$ layers with $r^{k}$ neurons per layer, for $k=1,\dots,m$ ( $k$ indicates a super index, not a power). Additionally, the layer $0$ is considered the input layer, and does not have neurons properly, but is it just the input data; the layer $m$ is the output layer and all the others are the intermediate or hidden layers. The parameters that define the network include the weights of the synapses that connect the different layers and the internal parameter of each neuron for defining the activation function. The benefit of using a parametric activation function is that the non-linear general expression is the same in all neurons. In this way, as represented at the bottom part of Fig. 3, the activation variable $a_{j}^{k}$ , namely, the input to the activation function of the $j-th$ neuron of layer $k$ , can be expressed as follows:

a_{j}^{k}=\sum\limits_{s=0}^{r^{k-1}}w_{js}^{k}v_{s}^{k-1}=\sum\limits_{s=0}^{r^{k-1}}w_{js}^{k}f\left(a_{s}^{k-1};p_{s}^{k-1}\right),

(5)

where $w_{j0}^{k}=b_{j}^{k}$ is the bias of the $j-th$ neuron of layer $k$ , thus considering $v_{0}^{k-1}=f\left(a_{0}^{k-1};p_{0}^{k-1}\right)=1$ . The output of the neuron is $v_{j}^{k}=f(a_{j}^{k};p_{j}^{k})$ , and notice that $v_{s}^{0}=f\left(a_{s}^{0};p_{s}^{0}\right)=x_{s}$ corresponds to the input data to the network, where $r^{0}$ is its dimensionality, and $v_{s}^{m}=f\left(a_{s}^{m};p_{s}^{m}\right)=\hat{y}_{s}$ is the output data, with dimensionality $r^{m}$ .

III-A Feedforward pass

The first challenge appears in the actual implementation of the activation function, since this is an implicit expression. In order to overcome this, a reweighted iteration process during each feedforward pass is proposed.

•

Method 1: Taking into account that $sign(a)=\frac{a}{|a|}$ , the activation function in (3), where $a$ is the input and $v$ is the output, can be expressed as follows:

$v=\frac{1}{1+\lambda p|v_{\text{prev}}|^{p-2}}a.$ (6)

In this way, a quick iteration can be done to approach the actual value of the output by considering the output of the previous iteration $v_{\text{prev}}$ on the evaluation of the right hand side of equation (6). However, this method has the disadvantage of a potential divergence for values of $p>2$ , when the input is large in magnitude, thus only assuring the convergence for $p<2$ .
•

Method 2: A second approach is proposed since the activation function (3) can be represented in a different way. Taking into account that, for this function, $\text{sign}(a-v)=\text{sign}(v)=\text{sign}(a)$ , the implicit expression can be reformulated as

$v=\left(\frac{1}{\lambda p}|a-v_{prev}|\right)^{\frac{1}{p-1}}\text{sign}(a),$ (7)

in which, again, a quick iteration over the output values can approach the actual output of the activation function. This second method will have convergence problems when $p$ is large for small input values in magnitude, as they will tend to be close to $1$ since $\lim\limits_{x\rightarrow 0}x^{x}=1$ , though the function is expected to have a quasi-linear behavior for input values smaller than $1$ in magnitude.

Since both methods result to be complementary, a solution for implementing the activation function consists on combining these two methods: if $p<2$ , use (6); if $p>2$ , use (6) for small input values and (7) for large input values. For $p=2$ is one of the few particular cases in which the activation function has an explicit expression, as $v=\frac{1}{1+2\lambda}a$ .

The last step for the feedforward evaluation consists on determine the threshold that discriminates the input values between small or large. As shown on the left side of Fig. 2, the activation functions are odd and strictly monotonically increasing functions, and have three clear intervals: one for large negative values, one for values around zero, and another one for large positive values. The points in which the tendency changes depends on the parameter $p$ , and marks the threshold for considering the input values as small or large. For values of $p>2$ , the slope of the functions gets a value close to $1$ for inputs around $0$ , while it gets small values for large inputs in magnitude. Then, the threshold can be selected as this point of change of tendency, selecting the input value for which the derivative of the activation functions is equal to $0.5$ .

From the implicit expression of the activation function in (3), its derivative $\frac{\partial v}{\partial a}$ can be expressed as follows:

v^{\prime}=\frac{\partial v}{\partial a}=-\frac{\partial R/\partial a}{\partial R/\partial v}=\frac{1}{\lambda p(p-1)|v|^{p-2}+1}.

(8)

By setting $v^{\prime}=0.5$ , the output value for the threshold, $v_{\tau}$ , is computed:

|v_{\tau}|=\left(\frac{1}{\lambda p(p-1)}\right)^{\frac{1}{p-2}}

(9)

Because of the properties of the activation function, it can be considered the positive value for the positive input and the negative value for the negative input. The expression in (9) indicates the value of the output of the activation function to achieve a slope of $0.5$ , but it is necessary to compute the value of the input $a_{\tau}$ . To this end, it is easy to extract the input given the output from (3), since

a_{\tau}=\lambda p|v_{\tau}|^{p-1}\text{sign}(v_{\tau})+v_{\tau}=v_{\tau}\left(1+\lambda p|v_{\tau}|^{p-2}\right).

(10)

And by introducing (9) into (10),

a_{\tau}=\left(\frac{1}{\lambda p\left(p-1\right)}\right)^{\frac{1}{p-2}}\left(1+\frac{1}{p-1}\right)

(11)

In summary, for evaluating the implicit activation function employ the following casuistic:

v=\begin{cases}\textit{Method 1, }\eqref{method1}&\text{if }p\leq 2\\ \textit{Method 1, }\eqref{method1}&\text{if }p>2\text{ and }a<a_{\tau}\\ \textit{Method 2, }\eqref{method2}&\text{if }p>2\text{ and }a\geq a_{\tau}\end{cases}

(12)

III-B Optimization of the parameters of the network via Backpropagation

Training a neural network is the process of obtaining the optimal parameters that defines it, commonly the weights and biases, given a set of input-output training data. Backpropagation is the training method that evaluates the error in the last layer and transfers it throughout the network, to the first layer. In the proposed network, besides the weights and biases, the neural network is tuned as well with the parameters $p$ of the activation function in each neuron.

In regards of the general structure of the $p-$ network described in Fig. 3, consider a training dataset $X=\left\{\left(\bm{x}^{(1)},\bm{y}^{(1)}\right),\dots,\left(\bm{x}^{(N)},\bm{y}^{(N)}\right)\right\}$ of size $N$ , where $\bm{x}^{(d)}\in\mathbb{R}^{r_{0}}$ is the input data and $\bm{y}^{(d)}\in\mathbb{R}^{r_{m}}$ is the output data, for all $d=1,\dots,N$ . The mean square error for a specific parameter $\theta$ of the network is

E(X,\theta)=\frac{1}{2N}\sum\limits_{d=1}^{N}\left(\hat{\bm{y}}^{(d)}-\bm{y}^{(d)}\right)^{2},

(13)

where $\hat{\bm{y}}^{(d)}$ is the computed output of the network for the input $\bm{x}^{(d)}$ . Notice that, for each sample data $d$ , the $l-th$ component of the output vector is $\hat{y}_{l}=f(a_{l}^{m};p_{l}^{m})$ , namely, the output of the $l-th$ neuron of the last layer.

The parameter $\theta$ is updated based on the gradient descend:

\theta^{t+1}=\theta^{t}-\alpha\frac{\partial E(X,\theta^{t})}{\partial\theta},

(14)

where $t$ is the iteration step and $\alpha$ is the learning ratio, which can be different for the parameters $p$ , $\alpha_{p}$ , than for the weights and biases, $\alpha_{w}$ .

Since the error can be accumulated for each training pair, the analysis can be done for each pair independently, dropping the dependence $d$ of the training sample.

III-B1 Optimization of the parameter $p$ in the activation functions

Consider a multidimensional regression neural network and start with the optimization of the parameters $p$ of the last layer. The error with respect to the parameter of the activation function of the $j-th$ neuron, $p_{j}^{m}$ is

	$\displaystyle\frac{\partial E}{\partial p_{j}^{m}}=\frac{\partial}{\partial p_{j}^{m}}\frac{1}{2}\\|\hat{\bm{y}}-\bm{y}\\|_{2}^{2}=\frac{\partial}{\partial p_{j}^{m}}\frac{1}{2}\sum\limits_{l=1}^{r^{m}}\left(\hat{y}_{l}-y_{l}\right)^{2}=$		(15)
	$\displaystyle=(\hat{y}_{j}-y_{j})\frac{\partial f(a_{j}^{m};p_{j}^{m})}{\partial p_{j}^{m}}=\left(f(a_{j}^{m};p_{j}^{m})-y_{j}\right)\frac{\partial f(a_{j}^{m};p_{j}^{m})}{\partial p_{j}^{m}}.$		(15)

The partial derivative of the activation function with respect to the parameter is required to complete the expression. Similar as shown in Eqn. (8), and remembering the implicit expression of the activation function in (3), this partial derivative can be computed as follows:

\displaystyle\frac{\partial f(a;p)}{\partial p}=\frac{\partial v}{\partial p}=-\frac{\partial R/\partial p}{\partial R/\partial v}=-\frac{\lambda|v|^{p-1}\text{sign}(v)\left(1+p\text{Ln}|v|\right)}{\lambda p(p-1)|v|^{p-2}+1}.

(16)

Therefore,

\displaystyle\frac{\partial E}{\partial p_{j}^{m}}=-(\hat{y}_{j}-y_{j})\frac{\lambda|\hat{y}_{j}|^{p_{j}^{m}-1}\text{sign}(\hat{y}_{j})\left(1+p_{j}^{m}\text{Ln}|\hat{y}_{j}|\right)}{\lambda p_{j}^{m}(p_{j}^{m}-1)|\hat{y}_{j}|^{p_{j}^{m}-2}+1}.

(17)

For the intermediate layers, the chain rule can be applied. The partial derivative of the error with respect to the parameter $p$ of the $j-th$ node of the $k-th$ layer can be expressed as follows:

\frac{\partial E}{\partial p_{j}^{k}}=\sum\limits_{h=1}^{r^{k+1}}\frac{\partial E}{\partial a_{h}^{k+1}}\frac{\partial a_{h}^{k+1}}{\partial p_{j}^{k}},

(18)

since this parameter affects the $r^{k+1}$ nodes of layer $k+1$ , where $a_{h}^{k+1}$ is the activation variable of the node $h$ of layer $k+1$ . Similarly, and making use of Eqn. (5),

\frac{\partial E}{\partial a_{j}^{k}}=\sum\limits_{l=1}^{r^{k+1}}\frac{\partial E}{\partial a_{l}^{k+1}}\frac{\partial a_{l}^{k+1}}{\partial a_{j}^{k}}=\sum\limits_{l=1}^{r^{k+1}}\frac{\partial E}{\partial a_{l}^{k+1}}w_{lj}^{k+1}\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial a_{j}^{k}}.

(19)

By defining the error term as

\delta_{j}^{k}=\frac{\partial E}{\partial a_{j}^{k}},

(20)

and introducing it into (19),

\delta_{j}^{k}=\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial a_{j}^{k}}\sum\limits_{l=1}^{k+1}w_{lj}^{k+1}\delta_{l}^{k+1}.

(21)

Knowing that

\frac{\partial a_{h}^{k+1}}{\partial p_{j}^{k}}=w_{hj}^{k+1}\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial p_{j}^{k}},

(22)

by introducing (20)-(22) into (18), the final expression is

\frac{\partial E}{\partial p_{j}^{k}}=\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial p_{j}^{k}}\sum\limits_{h=1}^{r^{k+1}}w_{hj}^{k+1}\delta_{h}^{k+1},

(23)

and defining, for the last layer,

\delta_{j}^{m}=(\hat{y}_{j}-y_{j})\frac{\partial f(a_{j}^{m};p_{j}^{m})}{\partial a_{j}^{m}}.

(24)

The computation of the expression $\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial p_{j}^{k}}$ is defined in (16) and, on the same way,

\displaystyle\frac{\partial f(a;p)}{\partial a}=\frac{\partial v}{\partial a}=-\frac{\partial R/\partial a}{\partial R/\partial v}=\frac{1}{\lambda p(p-1)|v|^{p-2}+1}.

(25)

III-B2 Optimization of the weights and biases

The optimization of the weights and biases is done as in any regular neural network. Considering the weight $w_{ji}^{k}$ that goes from the neuron $i$ of layer $k-1$ to the neuron $j$ of layer $k$ , and taking into account that the bias of neuron $j$ of layer $k$ is $w_{j0}^{k}$ , the partial derivative of the error with respect to the weight $w_{ji}^{k}$ can be represented as follows:

\frac{\partial E}{\partial w_{ji}^{k}}=\frac{\partial E}{\partial a_{j}^{k}}\frac{\partial a_{j}^{k}}{\partial w_{ji}^{k}}.

(26)

Using the expression in (5),

\frac{\partial a_{j}^{k}}{\partial w_{ji}^{k}}=f(a_{i}^{k-1};p_{i}^{k-1})=v_{i}^{k-1},

(27)

and together with the definition of the error term in (20), the partial derivative of the error with respect to the weights gets a simple expression:

\frac{\partial E}{\partial w_{ji}^{k}}=\delta_{j}^{k}v_{i}^{k-1},

(28)

where $v_{0}^{k-1}=1$ and $v_{i}^{0}=x_{i}$ , namely, the input data.

III-B3 Modification for a classification network

In case of using the network for classification, the last layer should be modified to a regular softmax function. The output of the $j-th$ neuron of the last layer would be

\hat{y}_{j}=\text{softmax}(a_{j}^{m})=\frac{e^{a_{j}^{m}}}{\sum\limits_{s=1}^{r^{m}}e^{a_{s}^{m}}}.

(29)

This last layer does not have parameters $p$ to optimize, thus its presence will only affect the optimization of the weights and biases. The gradient of the error with respect to $w_{ji}^{m}$ is

\frac{\partial E}{\partial w_{ji}^{m}}=\delta_{j}^{m}v_{i}^{m-1},

(30)

where

		$\displaystyle\delta_{j}^{m}=\frac{\partial E}{\partial a_{j}^{m}}=\left(\text{softmax}(a_{j}^{m})-y_{j}\right)\frac{\partial\text{softmax}(a_{j}^{m})}{\partial a_{j}^{m}}=$		(31)
		$\displaystyle=\left(\text{softmax}(a_{j}^{m})-y_{j}\right)\text{softmax}(a_{j}^{m})(1-\text{softmax}(a_{j}^{m}))=$
		$\displaystyle=\left(\hat{y}_{j}-y_{j}\right)\hat{y}_{j}\left(1-\hat{y}_{j}\right).$

The rest of parameters are optimized as in Sect. III-B.

III-B4 Summary of expressions of gradients

In summary, the final expressions for the optimization of the parameters of the network are the following:

•

Gradient of the error with respect to $p$ :

\frac{\partial E}{\partial p_{j}^{k}}=\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial p_{j}^{k}}\sum\limits_{h=1}^{r^{k+1}}w_{hj}^{k+1}\delta_{h}^{k+1},

(32)

for $k=1,\dots,m$ for regression, or for $k=1,\dots,m-1$ for classification.

•

Gradient of the error with respect to the weights and biases:

$\frac{\partial E}{\partial w_{ji}^{k}}=\delta_{j}^{k}v_{i}^{k-1},$ (33)

for $k=1,\dots,m$ . Where the error term is defined as

$\delta_{j}^{k}=\frac{\partial f(a_{j}^{k};p_{j}^{k})}{\partial a_{j}^{k}}\sum\limits_{l=1}^{k+1}w_{lj}^{k+1}\delta_{l}^{k+1},$ (34)

and, for the last layer,

$\delta_{j}^{m}=\left(\hat{y}_{j}-y_{j}\right)\frac{\partial f(a_{j}^{m};p_{j}^{m})}{\partial a_{j}^{m}}.$ (35)

for regression, and

$\displaystyle\delta_{j}^{m}=\left(\hat{y}_{j}-y_{j}\right)\hat{y}_{j}\left(1-\hat{y}_{j}\right)$ (36)

for classification.

The expressions for $\frac{\partial f(a;p)}{\partial p}$ and $\frac{\partial f(a;p)}{\partial a}$ are defined in (16) and (25), respectively, where $v_{i}^{k}=f(a_{i}^{k};p_{i}^{k})$ , $v_{0}^{k}=1$ and $v_{i}^{0}=x_{i}$ .

IV Numerical Results

The network has been implemented in Matlab and tested to validate its correct performance and to see how the activation functions adapt to the given data. The training is performed until reaching a maximum of number of iterations or when the training error goes below maximum error threshold.

IV-A Example #1: Evolution of the parameters

The first example simply tries to visualize the evolution of the parameters $p$ of each neuron during the training process. A simple 3-layers network with $5$ , $3$ , and $1$ neurons, respectively, is defined to match the non-linear function $y=sign(x)$ . The training data set is $N=100$ values taken from a uniform random distribution in the range [-5,5]. The configuration parameters are shown in Table I and the training results in Fig. 4. It can be seen how the parameter $p$ of each neuron evolves to adapt the shape of the activation function in order to better match the training data. Figure 5 illustrates the performance of the trained network. It performs as expected, although is shows some error since the output should be closer to $\pm 1$ , due to the reduced size of the training dataset.

TABLE I: Configuration parameter of

p-

network for example #1

Parameter	Value
$\lambda$	$1$
Initial $p$	2
Initial $w$	random $\sim\mathcal{N}(0,1)$
Max # of iterations	1000
Max error	$10^{-3}$
Iterations for the activation function	100
$\alpha_{p}$	100
$\alpha_{w}$	0.1

IV-B Example #2: Comparison with a regular feedforward network

The second example tries to compare the performance of the proposed $p-$ network with a regular feedforward network for regression. To this end, a 4-layers neural network with $10$ , $5$ , $3$ , and $1$ neurons per layer, respectively, is trained to implement (a) the function $y=x^{2}$ , and (b) the function $y=|x|$ . To make the comparison as fair as possible, the activation functions of the feedforward network are selected to be saturation linear and symmetric, (’satlins’) for the hidden layers, since they have the same shape as the proposed activation functions for $p=\infty$ ; and purelin, (’linear’) for the output layer, as a linear neuron is recommended for the last layer of regression networks, and it has the same shape as the proposed activation function for $p=2$ , and also $\lambda=0$ to have the same slope. Likewise, the learning ratios of the weights and biases are set the same. For numerical reasons, we set the initial value of $p$ to $100$ for the hidden layers, and the value of $\lambda$ to $10^{-10}$ . The configuration parameters of both networks are shown in Tables II and III, respectively. The training data set is $N=100$ values taken from a uniform random distribution in the range [-1,1]. Two types of training are done for the $p-$ network: (i) fixing the value of $p$ of each neuron, namely, $\alpha_{p}=0$ and the activation functions get fixed, and (ii), allowing the adaptation of the activation functions. The training progress and the performance results are shown in Figs. 6 and 7, while Table IV shows the error computed as defined in (13).

TABLE II: Configuration parameter of the

p-

network for example #2

Initial $p$	hidden layers: 100 output layer: 2
Parameter	Value
$\lambda$	$10^{-10}$
Initial $w$	random $\sim\mathcal{N}(0,1)$
Max # of iterations	1000
Max error	$10^{-4}$
Iterations for the activation function	10
$\alpha_{p}$	(i) 0, (ii) $10^{4}$
$\alpha_{w}$	0.01

TABLE III: Configuration parameter of the feedforward network for example #2

Activation functions	hidden layers: ’satlins’ output layer: ’linear’
Parameter	Value
Initial $w$	NGuyen-Widrow
Max # of iterations	1000
Max gradient	$10^{-4}$
$\alpha_{w}$	0.01

TABLE IV: Regression errors of example #2

Network	Error case $y=x^{2}$	Error case $y=\|x\|$
Feedforward	$2.3\cdot 10^{-3}$	$2.7\cdot 10^{-3}$
$p-$ network, $\alpha_{p}=0$	$3.2\cdot 10^{-3}$	$4.8\cdot 10^{-3}$
$p-$ network, $\alpha_{p}=10^{4}$	$7.2\cdot 10^{-4}$	$6.4\cdot 10^{-4}$

It can be seen how the adaptative $p-$ network reduces the error with respect to a equivalent regular network. It is also interesting to see how the final values of $p$ for each neuron greatly vary even within the same layer. Table V shows the final values of case (a) after the training is complete. Neuron #1 of layer 3 generates an activation function similar to a saturation linear and symmetric, while neuron #2 of the same layer generates an activation function similar to a ReLU. Notice that, since $p>1$ , the minimum allowed value of $p$ is $1.01$ .

TABLE V: Final values of

p

for example #2(

a

)

Parameter $p$	Layer 1	Layer 2	Layer 3	Layer 4
Neuron #1	139.95	395.13	326.07	2.00
Neuron #2	93.16	17.06	1.01
Neuron #3	39.22	43.27	28.00
Neuron #4	100.00	120.19
Neuron #5	100.00	165.76
Neuron #6	25.06
Neuron #7	100.00
Neuron #8	30.45
Neuron #9	190.52
Neuron #10	48.08

IV-C Example #3: Classification Application

Matlab contains a dataset of $24,075$ examples for describing $5$ types of human activities based on $60$ features. The type of activities are Sitting, Standing, Walking, Running, and Dancing, while the features include mean acceleration, root mean square body acceleration in $x$ , $y$ , and $z$ , among others.

The $p-$ network and a regular feedforward network are designed with 3 layers containing $30$ , $15$ , and $5$ neuron each, respectively. The last layer is configured as a softmax layer for classification. The configuration parameters for the $p-$ network and the feedforward network are shown in Tables VI and VII, respectively. The $p-$ network is trained with (i) fixed activation functions, $\alpha_{p}=0$ , and (ii) allowing the adaptation of the activation functions, $\alpha_{p}>0$ , to compare the results.

Out of the total dataset, randomly select $500$ data per class, for a total of $N=2,500$ training cases, and $500$ randomly selected cases from the remaining set for testing. This selection is done five times, computing the training and testing classification errors for each case.

TABLE VI: Configuration parameter of the

p-

network for example #3

Initial $p$	hidden layers: 5 output layer: ’softmax’
Parameter	Value
$\lambda$	$1$
Initial $w$	random $\sim\mathcal{N}(0,1)$
Max # of iterations	1000
Max error	$10^{-4}$
Iterations for the activation function	10
$\alpha_{p}$	(i) 0, (ii) $10^{-1}$
$\alpha_{w}$	0.1

TABLE VII: Configuration parameter of the feedforward network for example #3

Activation functions	hidden layers: ’tansig’ output layer: ’softmax’
Parameter	Value
Initial $w$	NGuyen-Widrow
Max # of iterations	1000
Max gradient	$10^{-4}$
$\alpha_{w}$	0.1

Table VIII and Fig. 8 show the results of the mean and standard deviation of the training and classification errors. Although the feedforward network provides a smaller training error, the adaptative $p-$ network is able to achieve a smaller testing error. However, non-adaptative $p-$ network gets the worse results.

TABLE VIII: Errors for human activity classification of example #3

Network	Training Error (std)	Testing Error (std)
$p-$ network, $\alpha_{p}=0$	$5.74\%$ $(0.40)$	$4.36\%$ $(0.62)$
$p-$ network, $\alpha_{p}=10^{-1}$	$5.46\%$ $(0.48)$	$3.80\%$ $(0.49)$
Feedforward	$4.99\%$ $(0.21)$	$4.12\%$ $(0.52)$

V Conclusion

This paper has proposed a new structure for the design of neural networks by defining a parametric non-linear activation function that can adapt its shape during the training process. This allows to configure the whole network with one single activation function definition, not having to predefine its expression for each layer, and also allowing different shapes for the neurons within the same layer after training. This increases the space of parameters for optimization, but provides more flexibility and generalization on the architecture of neural networks.

The proposed activation functions comes as the expression of the consensus variable when optimizing a linear problem with an $L_{p}^{q}-$ norm regularization term. This expression is, in general, implicit, meaning that there is no an explicit expression of the output variable given the input value. Although this is a challenge for its use, a detailed analysis is provided for its evaluation by the use of a reweighted iteration process. In this paper, the particular case of $1<p=q<\infty$ is implemented. The evaluation of the feedforward pass of the network with this activation function, as well as a detailed explanation of the training process via the backpropagation method for the optimization of the parameter $p$ and the weigths and biases is also provided for completeness.

Preliminary results for both regression and classification show that the proposed $p-$ network reduces the error of the testing sets when comparing with an equivalent regular neural network with the same number of layers and neuron, and with predefined activation functions that are as similar as possible as the $p-$ network activation function with the initial values of $p$ at the beginning of the training process.

Future work will implement the $pq-$ networks, that is, allowing the distinction of the parameters $p$ and $q$ during the training process to account for a larger set of possible shapes of the activation function. The application of this methodology to more complex networks, such as Deep, Convolutional, or Recurrent Neural Networks will open the way to the definition of more general and flexible neural networks. It is expected, consequently, a boost in the performance of these neural networks in terms of error reduction when using the proposed adaptive activation functions.

ACKNOWLEDGEMENT

This work has been funded by the Department of Energy (Award # DE-SC0017614).

References

[1] S. C. KR et al., “Real time object identification using deep convolutional neural networks,” in 2017 International Conference on Communication and Signal Processing (ICCSP). IEEE, 2017, pp. 1801–1805.
[2] S. Akcay, M. E. Kundegorski, C. G. Willcocks, and T. P. Breckon, “Using deep convolutional neural network architectures for object classification and detection within x-ray baggage security imagery,” IEEE transactions on information forensics and security, vol. 13, no. 9, pp. 2203–2215, 2018.
[3] A. Agarwal, S. Kumar, and D. Singh, “Development of neural network based adaptive change detection technique for land terrain monitoring with satellite and drone images,” Defence Science Journal, vol. 69, no. 5, p. 474, 2019.
[4] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: A convolutional neural-network approach,” IEEE transactions on neural networks, vol. 8, no. 1, pp. 98–113, 1997.
[5] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 8599–8603.
[6] F. Amato, A. López, E. M. Peña-Méndez, P. Vaňhara, A. Hampl, and J. Havel, “Artificial neural networks in medical diagnosis,” pp. 47–58, 2013.
[7] J. A. Weyn, D. R. Durran, R. Caruana, and N. Cresswell-Clay, “Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models,” Journal of Advances in Modeling Earth Systems, vol. 13, no. 7, p. e2021MS002502, 2021.
[8] J. Shen and M. O. Shafiq, “Short-term stock market price trend prediction using a comprehensive deep learning system,” Journal of big Data, vol. 7, no. 1, pp. 1–33, 2020.
[9] Z. Zhang, “Artificial neural network,” in Multivariate time series analysis in climate and environmental research. Springer, 2018, pp. 1–35.
[10] X. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1423–1447, 1999.
[11] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” in International conference on machine learning. PMLR, 2013, pp. 1319–1327.
[12] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” arXiv preprint arXiv:1412.6830, 2014.
[13] C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learned-norm pooling for deep feedforward and recurrent neural networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 530–546.
[14] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2009.
[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, July 2011.
[16] J. Heredia-Juesas, L. Tirado, A. Molaei, and J. A. Martinez-Lorenzo, “Admm based consensus and sectioning norm-1 regularized algorithm for imaging with a cra,” in 2019 IEEE International Symposium on Antennas and Propagation and USNC-URSI Radio Science Meeting. IEEE, 2019, pp. 549–550.
[17] J. Heredia Juesas, G. Allan, A. Molaei, L. Tirado, W. Blackwell, and J. A. M. Lorenzo, “Consensus-based imaging using admm for a compressive reflector antenna,” in 2015 IEEE International Symposium on Antennas and Propagation & USNC/URSI National Radio Science Meeting. IEEE, 2015, pp. 1304–1305.
[18] J. Heredia-Juesas, A. Molaei, L. Tirado, and J. A. Martinez-Lorenzo, “Fast node communication admm-based imaging algorithm with a compressive reflector antenna,” in 2018 IEEE International Symposium on Antennas and Propagation & USNC/URSI National Radio Science Meeting. IEEE, 2018, pp. 535–536.
[19] J. Heredia-Juesas, L. Tirado, and J. A. Martinez-Lorenzo, “Fast source reconstruction via admm with elastic net regularization,” in 2018 IEEE International Symposium on Antennas and Propagation & USNC/URSI National Radio Science Meeting. IEEE, 2018, pp. 539–540.
[20] J. Heredia-Juesas, A. Molaei, L. Tirado, and J. A. Martinez-Lorenzo, “Sectioning-based admm imaging for fast node communication with a compressive antenna,” IEEE Antennas and Wireless Propagation Letters, vol. 18, no. 2, pp. 226–230, 2018.
[21] ——, “Consensus and sectioning-based admm with norm-1 regularization for imaging with a compressive reflector antenna,” IEEE Transactions on Computational Imaging, vol. 7, pp. 1189–1204, 2021.
[22] J. Heredia-Juesas, A. Molaei, L. Tirado, W. Blackwell, and J. A. Martinez-Lorenzo, “Norm-1 regularized consensus-based admm for imaging with a compressive antenna,” IEEE Antennas and Wireless Propagation Letters, vol. 16, pp. 2362–2365, 2017.
[23] M. R. Osborne, B. Presnell, and B. A. Turlach, “On the lasso and its dual,” Journal of Computational and Graphical statistics, vol. 9, no. 2, pp. 319–337, 2000.
[24] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[25] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
[26] H.-P. Piepho, “Ridge regression and extensions for genomewide selection in maize,” Crop Science, vol. 49, no. 4, pp. 1165–1176, 2009.
[27] Z. Zhang, G. Dai, C. Xu, and M. I. Jordan, “Regularized discriminant analysis, ridge regression and beyond,” Journal of Machine Learning Research, vol. 11, no. Aug, pp. 2199–2228, 2010.
[28] S. Shahabuddin, M. Juntti, and C. Studer, “Admm-based infinity norm detection for large mu-mimo: Algorithm and vlsi architecture,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2017, pp. 1–4.
[29] J. Shen, H. Xu, and P. Li, “Online optimization for max-norm regularization,” in Advances in Neural Information Processing Systems, 2014, pp. 1718–1726.
[30] I. Gravagne and I. D. Walker, “Properties of minimum infinity-norm optimization applied to kinematically redundant robots,” in Proceedings. 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems. Innovations in Theory, Practice and Applications (Cat. No. 98CH36190), vol. 1. IEEE, 1998, pp. 152–160.
[31] J. Huang, S. Ma, H. Xie, and C.-H. Zhang, “A group bridge approach for variable selection,” Biometrika, vol. 96, no. 2, pp. 339–355, 2009.
[32] C. Park and Y. J. Yoon, “Bridge regression: adaptivity and group selection,” Journal of Statistical Planning and Inference, vol. 141, no. 11, pp. 3506–3519, 2011.
[33] W. J. Fu, “Penalized regressions: the bridge versus the lasso,” Journal of computational and graphical statistics, vol. 7, no. 3, pp. 397–416, 1998.
[34] C. De Mol, E. De Vito, and L. Rosasco, “Elastic-net regularization in learning theory,” Journal of Complexity, vol. 25, no. 2, pp. 201–230, 2009.
[35] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the royal statistical society: series B (statistical methodology), vol. 67, no. 2, pp. 301–320, 2005.
[36] B. Karlik and A. V. Olgac, “Performance analysis of various activation functions in generalized mlp architectures of neural networks,” International Journal of Artificial Intelligence and Expert Systems, vol. 1, no. 4, pp. 111–122, 2011.
[37] P. Sibi, S. A. Jones, and P. Siddarth, “Analysis of different activation functions using back propagation neural networks,” Journal of Theoretical and Applied Information Technology, vol. 47, no. 3, pp. 1264–1268, 2013.

Consensus Function from an Lpq−L_{p}^{q}-norm Regularization Term for its Use as Adaptive Activation Functions in Neural Networks