This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Cascaded Compressed Sensing Networks: A Reversible Architecture for Layerwise Learning

Weizhi Lu, Mingrui Chen, Kai Guo and Weiyu Li W. Lu, M. Chen and K. Guo are with the School of Control Science and Engineering, Shandong University, Jinan, China. E-mail: [email protected]. Li is with the Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China. E-mail: [email protected]
Abstract

Recently, the method that learns networks layer by layer has attracted increasing interest for its ease of analysis. For the method, the main challenge lies in deriving an optimization target for each layer by inversely propagating the global target of the network. The propagation problem is ill posed, due to involving the inversion of nonlinear activations from low-dimensional to high-dimensional spaces. To address the problem, the existing solution is to learn an auxiliary network to specially propagate the target. However, the network lacks stability, and moreover, it results in higher complexity for network learning. In the letter, we show that target propagation could be achieved by modeling the network’s each layer with compressed sensing, without the need of auxiliary networks. Experiments show that the proposed method could achieve better performance than the auxiliary network-based method.

Index Terms:
Deep neural networks, reversible networks, layerwise learning, compressed sensing

I Introduction

The error backpropagation (BP) method has achieved great success in the supervised learning of deep neural networks [1]. Due to taking the entire network as a whole to optimize, the method can hardly disclose the contribution of each layer, causing difficulty to network analysis. As an alternative to BP, the emerging layerwise learning method [2, 3] seems more suitable for network analysis. The method aims to learn the network layer by layer and thus reduces the network analysis to the layer level.

For layerwise learning, the key is to derive an optimization target for each layer, simply called local target hereafter. Currently, the most simple method is to directly employ the global target [4], namely the optimization target of the entire network, as the target of each layer. However, the global target cannot guarantee to be optimal for each layer to minimize the global loss. To achieve BP level performance, the methods in [5, 6] propose to append a classifier to the layer of interest, and then optimize them together with the global target. This forms a shallow network with at least one hidden layer, in which the layer of interest is still hidden and thus hard to analyze [7, 8, 9]. To directly optimize and analyze each layer, we resort to another emerging layerwise learning method known as target propagation (TP) [2, 3, 10], which derives the local target of each layer by inversely propagating the global target, and optimizes the layer by matching its output to the target. In the method, local targets are required to be of two properties. First, their feedforward outputs should be identical or close to the global target. This ensures that the optimization of each layer could reduce the global loss to the maximum extent. Second, the local target of each layer should be close to the forward output of the layer. This will help to reduce the variation of each layer over optimization, while the low variations incline to improve the network’s generalization [11, 12, 13]. However, the two properties are hard to achieve because the inverse propagation of global target involves the inversions of nonlinear activations from low-dimensional to high-dimensional spaces. This is an ill-posed problem. For the problem, the existing solution is to learn an auxiliary network to specially propagate the target. Obviously, the method will increase the complexity of network learning, and moreover, the auxiliary network has no guarantee for stable and accurate target propagation. In the letter, we show that auxiliary networks are not necessary for layerwise learning, and target propagation could be achieved with guaranteed accuracy by modeling the network’s each layer with compressed sensing.

In compressed sensing [14, 15], it has been proved that a high-dimensional signal xnx\in\mathbb{R}^{n} with sparse transformations s=Dxs=Dx over an orthonormal dictionary Dn×nD\in\mathbb{R}^{n\times n}, could be recovered from its low-dimensional projection y=WDx=Wsy=WDx=Ws, where Wm×nW\in\mathbb{R}^{m\times n}, mnm\ll n. The recovery is implemented by first recovering the sparse vector ss from yy via solving an 0\ell_{0} or 1\ell_{1}-regularized least square problem [16], and then simply deriving x=Dsx=D^{\top}s. Inspired by this reverse process, we propose to model each network layer to be a compressed sensing process. Specifically, we formulate each layer as a cascade of WW and DD, such that the input ss of WW could be recovered from its output yy via sparse recovery, and the input xx of DD (namely the output yy of WW) could be simply recovered from its output ss by multiplying with DD^{\top}. Repeating this process layer by layer, target propagation will be finally accomplished. To obtain better performance, as usual, we suggest to further deploy a nonlinear function f()f(\cdot) following each dictionary DD, in order to remove small activations in magnitude. This operation brings two benefits. First, it further improves the sparseness of the output activations of DD, which is beneficial to the sparse recovery on the following 𝐖\mathbf{W}. Second, it tends to reduce within-class scattering and thus improve the classification accuracy [17]. Overall, with the perfect combination of WW, DD and f()f(\cdot), we establish a compressed sensing framework for each layer, and then obtain a cascaded compressed sensing architecture with all layers. To the best of our knowledge, this is the first time that compressed sensing is introduced for layerwise learning to handle the target propagation problem.

Compared to auxiliary networks-based TP methods, the proposed compressed sensing-based TP method has two major advantages. First, it avoids the usage of auxiliary networks and thus reduces the complexity of network learning. Second, it could achieve target propagation with guaranteed accuracy. In layerwise learning, the amount of local targets we need to compute and store is equal to the number of activations in the layer. Considering convolution networks contain huge amounts of activations and pose great computational challenges to layerwise learning, as in [3, 10], we only test the proposed layerwise learning method on fully-connected networks for tractable computation. Our performance advantage is verified by the classification experiments on MNIST and CIFAR10.

II Method

Refer to caption
Figure 1: The proposed depth-LL network structure, with feedforward activations marked on the upper part and feedback targets marked on the lower part.

Let us consider a deep fully connected network with LL layers as shown in Fig. 1. Each layer consists of two sublayers {Wl,Dl}\{W_{l},D_{l}\}, where Wlml×nlW_{l}\in\mathbb{R}^{m_{l}\times n_{l}} with mlnlm_{l}\ll n_{l} denotes a projection matrix and Dlml×mlD_{l}\in\mathbb{R}^{m_{l}\times m_{l}} means an orthonormal dictionary, DD=ID^{\top}D=I. Following DlD_{l} is a pointwise thresholding function fl()f_{l}(\cdot) which zeroes out all but the klk_{l} (ml\ll m_{l}) largest positive elements, acting similarly to the ReLU function. Note that the output layer is a classifier which comprises WLW_{L} but no DLD_{L}. Suppose hWlmlh_{W}^{l}\in\mathbb{R}^{m_{l}} and hDlmlh_{D}^{l}\in\mathbb{R}^{m_{l}} are the output activations of WlW_{l} and DlD_{l}, respectively. The feedforward network structure is formulated by

hWl=WlhDl1andhDl=fl(DlhWl),l=1,,L1h_{W}^{l}=W_{l}h_{D}^{l-1}\leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ h_{D}^{l}=f_{l}(D_{l}h_{W}^{l}),\leavevmode\nobreak\ \leavevmode\nobreak\ l=1,\ldots,L-1 (1)
hWl=g(WlhDl1),l=Lh_{W}^{l}=g(W_{l}h_{D}^{l-1}),\leavevmode\nobreak\ \leavevmode\nobreak\ l=L (2)

where g()g(\cdot) is a softmax function, and the notations hD0h_{D}^{0} and hWLh_{W}^{L} denote the network’s input and output, respectively.

As sketched in Algorithm 1, the proposed layerwise learning is implemented within three phases. First, we initialize the network via feeding forward the network input hD0h_{D}^{0}. During this process, the projection matrices WlW_{l} are randomly generated, and the orthonormal dictionaries DlD_{l} are learned with their input activations hWl1{h}_{W}^{l-1}. Second, as shown in the lower part of Fig.1, by compressed sensing the global targets h~WL\tilde{h}_{W}^{L} are inversely propagated layer by layer, producing local targets h~Wl\tilde{h}_{W}^{l} and h~Dl\tilde{h}_{D}^{l} respectively for WlW_{l} and DlD_{l}, 1l<L1\leq l<L. Finally, we feed forward the input hD0h_{D}^{0} again, in order to successively minimize the local losses l(hWl\mathcal{L}_{l}({h}_{W}^{l}, h~Wl)\tilde{h}_{W}^{l}) and l(hDl,h~Dl)\mathcal{L}_{l}({h}_{D}^{l},\tilde{h}_{D}^{l}) of each layer by updating WlW_{l} and DlD_{l}, 1l<L1\leq l<L. This finally minimizes the global loss (hWL\mathcal{L}({h}_{W}^{L}, h~WL)\tilde{h}_{W}^{L}). In the letter, we simply define the local losses with Euclidean distances and the global loss with cross entropy. In the following, we detail the three phases.

Algorithm 1 The proposed layerwise learning
1:  Phase 1: Activation forward propagation.
2:  for l=1l=1 to LL do
3:     Generate hWlh_{W}^{l} and hDlh_{D}^{l} by (1) and (2).
4:     Initialize WlW_{l} with Gaussian distributions.
5:     Learn DlD_{l} with its input hWlh_{W}^{l} by Algorithm 2.
6:  end for
7:  Further update the final WLW_{L} by (3).
8:  Phase 2: Target backward propagation.
9:  Generate h~DL1\tilde{h}_{D}^{L-1} by (6) and h~WL1\tilde{h}_{W}^{L-1} by (5).
10:  for l=L2l=L-2 to 22 do
11:     Generate h~Dl1\tilde{h}_{D}^{l-1} by (4) and h~Wl1\tilde{h}_{W}^{l-1} by (5).
12:  end for
13:  Phase 3: Forward layerwise learning.
14:  for l=1l=1 to L1L-1 do
15:     Update WlW_{l} by (7) and DlD_{l} by (9) and (10).
16:  end for
17:  Update the final WLW_{L} by (12).

II-A Activation forward propagation

By (1) and (2), we know the network takes hD0h_{D}^{0} as input and hWLh_{W}^{L} as output. During the feedforward propagation, we initialize the projection matrix WlW_{l} by drawing its elements from standard normal distribution. The distribution has been proved nearly optimal for compressed sensing [18]. To keep the input activation hDl1h_{D}^{l-1} unchanged in 2\ell_{2} norm after projections, WlW_{l} needs to be scaled by a factor 1/ml1/\sqrt{m_{l}} [19]. For the output layer WLW_{L}, besides random initialization, we further update it by

WL=WLα(hWL,h~WL)WL,{W}_{L}={W}_{L}-\alpha\frac{\partial\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})}{\partial{W}_{L}}, (3)

in order to reduce the difference between the final output hWLh_{W}^{L} and the global target h~WL\tilde{h}_{W}^{L}. As detailed later, the reduced difference is beneficial to the final layerwise optimization. Empirically, the parameter α\alpha in (3) should be small in case of overfitting.

Next, we move to learning the orthonormal dictionary DlD_{l}, which has dense input hWlh_{W}^{l} and sparse output hDlh_{D}^{l}. Such kind of dictionaries can be learned with an procrustean approach [20], which is introduced in Algorithm 2 for completeness. In the algorithm, there is a crucial parameter dd, which counts the number of largest elements maintained by the thresholding function T()T(\cdot). To obtain sparse transforms, we should adopt a relatively small dd. Note that Algorithm 2 supports online learning, and the property is necessary for batch learning. From the viewpoint of classification, we can say that the sparse transform matrix DlD_{l} is mainly used for feature selection, and the projection matrix WlW_{l} serves not only for feature selection but for dimensionality reduction. Moreover, it is noteworthy that the order of DlD_{l} and WlW_{l} could be reversed in the cascaded compressed sensing framework. In the letter, we decide to put WlW_{l} ahead in order to decrease the dimension of the network input hD0h_{D}^{0} and then reduce the optimization complexity.

Finally, let us see the initialization of the nonlinear function fl()f_{l}(\cdot), which has one thresholding parameter to maintain the klk_{l} largest positive elements in the output of DlD_{l}. Note that fl()f_{l}(\cdot) only keeps a portion of positive elements and discards the negative ones, because empirically both of them tend to share the similar information for classification. The choice of klk_{l} depends on both the sparse degree of the output of DlD_{l} and the compression ratio of the following Wl+1W_{l+1}. Empirically, as discussed later, a large klk_{l} that allows for containing most of positive elements tends to provide good classification performance, when the dictionary output is not very sparse.

II-B Target backward propagation

The global target h~WL\tilde{h}_{W}^{L} is propagated backward by repeating the following compressed sensing recovery process:

h~Dl1=argminhDl1h~WlWlhDl122s.t.hDl10kl1\tilde{h}_{D}^{l-1}=\arg\min_{{h}_{D}^{l-1}}\|\tilde{h}_{W}^{l}-W_{l}{h}_{D}^{l-1}\|_{2}^{2}\leavevmode\nobreak\ \text{s.t.}\leavevmode\nobreak\ \|{h}_{D}^{l-1}\|_{0}\leq k_{l-1} (4)

and

h~Wl1=Dl1h~Dl1\tilde{h}_{W}^{l-1}=D_{l-1}^{\top}\tilde{h}_{D}^{l-1} (5)

where the layer index ll decreases from LL to 2; h~Dl1\tilde{h}_{D}^{l-1} and h~Wl1\tilde{h}_{W}^{l-1} denote the local targets derived respectively for DL1D_{L-1} and WL1W_{L-1}. The derivations of the two targets are analyzed as follows. Let us first see the deriving of h~Dl1\tilde{h}_{D}^{l-1} by (4)\eqref{eq_backD}. By compressed sensing, the row size mlm_{l} of WlW_{l} in (4) should be at least twice its input sparsity kl1k_{l-1}, namely ml>2kl1m_{l}>2k_{l-1}, in order to approximately recover h~Dl1\tilde{h}_{D}^{l-1} from h~Wl\tilde{h}_{W}^{l}. Empirically, the condition requires not to be strictly met, and a good classification is usually obtained as ml<2kl1m_{l}<2k_{l-1}. This implies that for layerwise learning, the feedforward of sufficient amounts of features may be more important than the accuracy of target propagation. We could solve (4)\eqref{eq_backD} with many algorithms [21]. In our experiments, we will employ the known OMP [22] algorithm, which has a parallel implementation in the software SPAMS [23]. As for the local target h~Wl1\tilde{h}_{W}^{l-1}, it is seen in (5) that the target could be simply derived from h~Dl1\tilde{h}_{D}^{l-1} by multiplication, benefiting from the orthonormality of Dl1D_{l-1}.

Usually, we prefer to simply label each class with a one-hot vector h~WL\tilde{h}_{W}^{L}, such that the instances of the same class will share the same local target at each layer. Note that the label sharing may cause difficulty for layerwise optimization, as the instances of the same class have high variations, as often encountered at early layers [3]. To avoid the problem, an effective solution is to adopt diverse labels/tagets to represent the instances of the same class [3]. To achieve this, instead of (4), we propose to derive the local target h~DL1\tilde{h}_{D}^{L-1} of the penultimate layer by

h~DL1=hDL1β(hWL,h~WL)hDL1\tilde{h}_{D}^{L-1}={h}_{D}^{L-1}-\beta\frac{\partial\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})}{\partial{h}_{D}^{L-1}} (6)

and then derive the targets for the rest layers by (4) and (5).

Algorithm 2 Online orthonormal dictionary learning [24]
1:  Input: Dataset Xm×nX\in\mathbb{R}^{m\times n} of nn samples of length mm and initialized dictionary D(0)D^{(0)}.
2:  for j=1j=1 to JJ do
3:     S(j)=T(D(j)X)S^{(j)}=T({D^{(j)}}^{\top}X), T()T(\cdot) is a threshold function to zero out all but dd largest elements (in magnitudes) in each column of its matrix input.
4:     Run SVD for XS(j)=PΣQX{S^{(j)}}^{\top}=P\Sigma Q^{\top}.
5:     D(j+1)=PQD^{(j+1)}=PQ^{\top}.
6:  end for
7:  Output: Learned dictionary D=D(J)D={D^{(J)}}^{\top}

Considering the global loss (hWL,h~WL)\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L}) has been previously decreased in the initialization phase by updating WLW_{L} in (3), we use a small β\beta in (6) to further reduce the global loss to achieve a good classification performance. It is seen that a small β\beta will result in a small difference hDL1h~DL1{h}_{D}^{L-1}-\tilde{h}_{D}^{L-1}, and then a sequence of small hDlh~Dl{h}_{D}^{l}-\tilde{h}_{D}^{l} (with ll decreasing from L2L-2 to 11) by the stable recovery of compressed sensing [25]. The small differences between forward activations and backward targets will reduce the variations of DlD_{l} and WlW_{l} in optimization, while the low variations over optimization incline to improve the network’s generalization [11, 12, 13]. To achieve such property, the TP method in [3] constrains the difference between targets and activations, as learning the auxiliary network. However, the network-based method lacks stability, thus inferior to our compressed sensing-based method.

II-C Forward layerwise learning

Feeding forward the network input hD0h_{D}^{0}, we successively update the parameters {Wl,Dl}\{{W}_{l},{D}_{l}\} of each layer by matching their outputs hWl{h}_{W}^{l} and hDl{h}_{D}^{l} to their local targets h~Wl\tilde{h}_{W}^{l} and h~Dl\tilde{h}_{D}^{l}. Formally, this process is realized by iterating

W~l=argminWlh~WlWlhDl122+λWlWl22\tilde{W}_{l}=\arg\min_{W_{l}}\|\tilde{h}_{W}^{l}-W_{l}{h}_{D}^{l-1}\|_{2}^{2}+\lambda_{W_{l}}\|W_{l}\|_{2}^{2} (7)
hWl=W~lhDl1{h}_{W}^{l}=\tilde{W}_{l}{h}_{D}^{l-1} (8)
hWlh~Dl=PΣQ{h}_{W}^{l}\tilde{h}{{}_{D}^{l}}^{\top}=P\Sigma Q^{\top} (9)
D~l=QP\tilde{D}_{l}=QP^{\top} (10)
hDl=fl(D~lhWl){h}_{D}^{l}=f_{l}(\tilde{D}_{l}{h}_{W}^{l}) (11)

where the index ll increases from 11 to L1L-1. As a classifier, the final output layer is updated by

W~L=WLη(hWL,h~WL)WL.\tilde{W}_{L}={W}_{L}-\eta\frac{\partial\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})}{\partial{W}_{L}}. (12)

Then the resulting {W~l,D~l}l=1L\{\tilde{W}_{l},\tilde{D}_{l}\}_{l=1}^{L} constitute the final network. Note that the dictionary updating rules (9) and (10) are adapted from Algorithm 2. For tractable computation, we simply update Wl{W}_{l} with the ridge regression model (7), although other more sophisticated models, such as the Lasso [26] and elastic-net [27], probably yield better updates.

For large-scale data, such as ImageNet [28], we suggest to process the data in batches and perform layerwise learning iteratively. But for small or middle-scaled data, such as MNIST [29] and CIFAR10 [30], the learning could be accomplished in a single pass. In this case, compressed sensing-based target propagation needs to be performed only once, implying that the dictionaries are not required to be orthonormal in the final forward optimization and they could be simply updated by

D~l=argminDlh~DlDlhWl22+λDlDl22,\tilde{D}_{l}=\arg\min_{D_{l}}\|\tilde{h}_{D}^{l}-D_{l}{h}_{W}^{l}\|_{2}^{2}+\lambda_{D_{l}}\|D_{l}\|_{2}^{2}, (13)

instead of the SVD method used in (9) and (10).

III Experiments

In this section, we aim to prove that the proposed compressed sensing-based target propagation (CSTP) method could achieve better performance than the auxiliary network-based target propagation (ANTP) method [3], on the layerwise learning of deep networks. Moreover, we analyze the proposed CSTP method by ablation study. For comparison, as in [3], we test the proposed CSTP method on two typical datasets, MNIST [29] and CIFAR10 [30]. The MINIST dataset consists of 28×2828\times 28 gray-scale images of handwritten digits. It has a training set of 60,000 samples, and a test set of 10,000 samples. The CIFAR10 dataset consists of 32×3232\times 32 natural color images in 10 classes. There are 50,000 samples in the training set and 10,000 samples in the testing set.

III-A Settings

In the following, we briefly introduce the parameters set for the proposed CSTP-based layerwise learning algorithm. As shown in Algorithm 1, the algorithm consists of three phases. In the initial feedforward phase, the dictionaries are learned with Algorithm 2, for which we set the iteration number to be 20 and the sparsity parameter dml/3d\approx m_{l}/3. The thresholding function fl()f_{l}(\cdot) keeps about klml/2k_{l}\approx m_{l}/2 positive elements. By (3), we initialize the output layer WLW_{L}, namely the classifier. To obtain better performance, we suggest to handle (3) with stochastic gradient descent (SGD), for which we set the momentum as 0.9, weight decay as 0.0005, batch size as 64, and epoch number as 30. The learning rate is initialized as 0.01, which is decayed by multiplying a factor of 0.9 every 4 epochs. In the target backpropagation phase, we first diversify the targets by (6), with β=0.2\beta=0.2. Then the targets are backpropagated by (4), which is implemented with OMP [22], with klnl/3k_{l}\approx n_{l}/3. In the final layerwise learning phase, we update WlW_{l} and DlD_{l} by directly solving (7) and (13). The output layer WLW_{L} is updated in (12) by SGD, with the same parameters as set in (3). The above parameter settings are adopted both for MNIST and CIFAR10.

III-B Results

III-B1 Performance comparison on CIFAR10

For the layerwise learning with CSTP, we derive a test error of 47.76% on a three-layer network of size 3072-2500-1500-10. In contrast, in [3] ANTP reports an error of 49.29% on a four-layer network of size 3072-1000-1000-1000-10. This means that CSTP could achieve higher accuracy than ANTP, when given appropriate network sizes. Recall that the network size of CSTP is constrained by compressed sensing. Similarly to ANTP [3], the CSTP-based layerwise learning performs worse than the prevailing BP method, with an accuracy gap about 4% on the network we test. As discussed in [31], the inferior performance is due to the mismatching between the poorly-separable features and the strong supervision constraint imposed on each layer.

III-B2 Performance comparison on MNIST

In [3], ANTP achieves a test error of 1.94% on a network with 7 hidden layers each consisting of 240 units. The error is reduced about 0.15%, as we test CSTP on a wide but shallow network of size 784-3500-2500-10. From Table 1, it is seen that wider layers tend to yield better performance for CSTP. The reason is as follows. The performance of CSTP depends on the dictionary learning-based feature selection and compressed sensing-based target propagation. For dictionary learning, higher dimensions (namely wider layers) usually could lead to sparser transforms, which are beneficial not only to feature selection but also to compressed sensing.

III-B3 Ablation study of DlD_{l} and fl()f_{l}(\cdot)

To see the importance of the dictionary DlD_{l} and nonlinear function fl()f_{l}(\cdot) in CSTP, we remove them in Algorithm 1, and after fine tuning, we still witness an accuracy reduction about 1.2% for CIFAR10 and 0.7% for MNIST. If only removing fl()f_{l}(\cdot), the accuracy decreases about 0.3% for CIFAR10 and 0.1% for MNIST.

Table I: The layerwise learning performance of CSTP on different network architectures.
Datasets Network sizes Error (%\%)
MNIST 784-4000-3000-10 1.72
784-3500-2500-10 1.79
784-2000-1000-10 2.86
CIAFR10 3072-4000-3000-10 47.48
3072-2500-1500-10 47.76
3072-2000-1000-10 48.87

IV Conclusion

In the letter, we have shown that the cascade of compressed sensing can be established as a reversible network, which allows for layerwise learning and analysis. The optimization target of each layer is derived by inversely propagating the target of the network via compressed sensing. The method avoids learning an auxiliary network to specially propagate the target, reducing the complexity of network learning. Also, the method could achieve favorable performance owing to the stableness of compressed sensing. The proposed target propagation involves two major computations: dictionary learning and sparse recovery, which both have polynomial complexity. To reduce the complexity, it is interesting to sparsify the structures of the dictionaries and projection matrices, such that some combinatorial algorithms with linear complexity could be adopted [32, 33]. This will be left for our future work. Note that in the letter we focus our interest on the layerwise learning method that optimizes exactly one layer at a time and thus reduces the network analysis to the layer level. Currently, the method seems difficult to achieve BP level performance, especially on large-scale data [31]. The reason is that the finite number of parameters in one layer can hardly handle the matching between poorly-separable features and strong supervision constraints [31]. The problem could be addressed if more than one layer is learned at a time [5, 6].

References

  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012.
  • [2] Y. Bengio, “How auto-encoders could provide credit assignment in deep networks via target propagation,” arXiv preprint arXiv:1407.7906, 2014.
  • [3] D.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio, “Difference target propagation,” in Joint european conference on machine learning and knowledge discovery in databases.   Springer, 2015, pp. 498–515.
  • [4] E. S. Marquez, J. S. Hare, and M. Niranjan, “Deep cascade learning,” IEEE transactions on neural networks and learning systems, vol. 29, no. 11, pp. 5475–5485, 2018.
  • [5] E. Belilovsky, M. Eickenberg, and E. Oyallon, “Greedy layerwise learning can scale to imagenet,” in International conference on machine learning.   PMLR, 2019, pp. 583–593.
  • [6] A. Nøkland and L. H. Eidnes, “Training neural networks with local error signals,” in International Conference on Machine Learning.   PMLR, 2019, pp. 4839–4850.
  • [7] A. R. Barron, “Approximation and estimation bounds for artificial neural networks,” Machine learning, vol. 14, no. 1, pp. 115–133, 1994.
  • [8] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, pp. 143–195, 1999.
  • [9] F. Bach, “Breaking the curse of dimensionality with convex neural networks,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 629–681, 2017.
  • [10] S. Bartunov, A. Santoro, B. A. Richards, L. Marris, G. E. Hinton, and T. Lillicrap, “Assessing the scalability of biologically-motivated deep learning algorithms and architectures,” arXiv preprint arXiv:1807.04587, 2018.
  • [11] G. K. Dziugaite and D. M. Roy, “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data,” arXiv preprint arXiv:1703.11008, 2017.
  • [12] V. Nagarajan and J. Z. Kolter, “Generalization in deep networks: The role of distance from initialization,” arXiv preprint arXiv:1901.01672, 2019.
  • [13] L. Q. Trinh, “Greedy layerwise training of convolutional neural networks,” Ph.D. dissertation, Massachusetts Institute of Technology, 2019.
  • [14] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE Transactions on Information Theory, vol. 49, no. 12, pp. 3320–3325, Dec. 2003.
  • [15] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via l1l^{1} minimization,” Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003.
  • [16] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, no. 1, pp. 129–159, 2001.
  • [17] J. Zarka, L. Thiry, T. Angles, and S. Mallat, “Deep network classification by scattering and homotopy dictionary learning,” in International Conference on Learning Representations, 2020.
  • [18] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203 – 4215, Dec. 2005.
  • [19] W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” Contemp. Math., vol. 26, pp. 189–206, 1984.
  • [20] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
  • [21] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang, “A survey of sparse representation: algorithms and applications,” IEEE access, vol. 3, pp. 490–530, 2015.
  • [22] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition,” in Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, Nov. 1993, pp. 40–44.
  • [23] J. Mairal, F. Bach, and J. Ponce, Sparse Modeling for Image and Vision Processing, ser. Foundations and Trends in Computer Graphics and Vision.   Now publishers, Dec. 2014, vol. 8, no. 2-3.
  • [24] C. Bao, J.-F. Cai, and H. Ji, “Fast sparsity-based orthogonal dictionary learning for image restoration,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3384–3391.
  • [25] E. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics, vol. 59, pp. 1207–1223, 2006.
  • [26] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, vol. 58, no. 1, pp. 267–288, 1996.
  • [27] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the royal statistical society: series B (statistical methodology), vol. 67, no. 2, pp. 301–320, 2005.
  • [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [29] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  • [30] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  • [31] W. Ma, M. Yu, K. Li, and G. Wang, “Why layer-wise learning is hard to scale-up and a possible solution via accelerated downsampling,” in IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI).   IEEE, 2020, pp. 238–243.
  • [32] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proceedings of the IEEE, vol. 98, no. 6, pp. 937–947, June 2010.
  • [33] W. Lu, W. Li, W. Zhang, and S.-T. Xia, “Expander recovery performance of bipartite graphs with girth greater than 4,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 3, pp. 418–427, 2018.