Cascaded Compressed Sensing Networks: A Reversible Architecture for Layerwise Learning

Weizhi Lu, Mingrui Chen, Kai Guo and Weiyu Li W. Lu, M. Chen and K. Guo are with the School of Control Science and Engineering, Shandong University, Jinan, China. E-mail: [email protected]. Li is with the Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China. E-mail: [email protected]

Abstract

Recently, the method that learns networks layer by layer has attracted increasing interest for its ease of analysis. For the method, the main challenge lies in deriving an optimization target for each layer by inversely propagating the global target of the network. The propagation problem is ill posed, due to involving the inversion of nonlinear activations from low-dimensional to high-dimensional spaces. To address the problem, the existing solution is to learn an auxiliary network to specially propagate the target. However, the network lacks stability, and moreover, it results in higher complexity for network learning. In the letter, we show that target propagation could be achieved by modeling the network’s each layer with compressed sensing, without the need of auxiliary networks. Experiments show that the proposed method could achieve better performance than the auxiliary network-based method.

Index Terms:

Deep neural networks, reversible networks, layerwise learning, compressed sensing

I Introduction

The error backpropagation (BP) method has achieved great success in the supervised learning of deep neural networks [1]. Due to taking the entire network as a whole to optimize, the method can hardly disclose the contribution of each layer, causing difficulty to network analysis. As an alternative to BP, the emerging layerwise learning method [2, 3] seems more suitable for network analysis. The method aims to learn the network layer by layer and thus reduces the network analysis to the layer level.

For layerwise learning, the key is to derive an optimization target for each layer, simply called local target hereafter. Currently, the most simple method is to directly employ the global target [4], namely the optimization target of the entire network, as the target of each layer. However, the global target cannot guarantee to be optimal for each layer to minimize the global loss. To achieve BP level performance, the methods in [5, 6] propose to append a classifier to the layer of interest, and then optimize them together with the global target. This forms a shallow network with at least one hidden layer, in which the layer of interest is still hidden and thus hard to analyze [7, 8, 9]. To directly optimize and analyze each layer, we resort to another emerging layerwise learning method known as target propagation (TP) [2, 3, 10], which derives the local target of each layer by inversely propagating the global target, and optimizes the layer by matching its output to the target. In the method, local targets are required to be of two properties. First, their feedforward outputs should be identical or close to the global target. This ensures that the optimization of each layer could reduce the global loss to the maximum extent. Second, the local target of each layer should be close to the forward output of the layer. This will help to reduce the variation of each layer over optimization, while the low variations incline to improve the network’s generalization [11, 12, 13]. However, the two properties are hard to achieve because the inverse propagation of global target involves the inversions of nonlinear activations from low-dimensional to high-dimensional spaces. This is an ill-posed problem. For the problem, the existing solution is to learn an auxiliary network to specially propagate the target. Obviously, the method will increase the complexity of network learning, and moreover, the auxiliary network has no guarantee for stable and accurate target propagation. In the letter, we show that auxiliary networks are not necessary for layerwise learning, and target propagation could be achieved with guaranteed accuracy by modeling the network’s each layer with compressed sensing.

In compressed sensing [14, 15], it has been proved that a high-dimensional signal $x\in\mathbb{R}^{n}$ with sparse transformations $s=Dx$ over an orthonormal dictionary $D\in\mathbb{R}^{n\times n}$ , could be recovered from its low-dimensional projection $y=WDx=Ws$ , where $W\in\mathbb{R}^{m\times n}$ , $m\ll n$ . The recovery is implemented by first recovering the sparse vector $s$ from $y$ via solving an $\ell_{0}$ or $\ell_{1}$ -regularized least square problem [16], and then simply deriving $x=D^{\top}s$ . Inspired by this reverse process, we propose to model each network layer to be a compressed sensing process. Specifically, we formulate each layer as a cascade of $W$ and $D$ , such that the input $s$ of $W$ could be recovered from its output $y$ via sparse recovery, and the input $x$ of $D$ (namely the output $y$ of $W$ ) could be simply recovered from its output $s$ by multiplying with $D^{\top}$ . Repeating this process layer by layer, target propagation will be finally accomplished. To obtain better performance, as usual, we suggest to further deploy a nonlinear function $f(\cdot)$ following each dictionary $D$ , in order to remove small activations in magnitude. This operation brings two benefits. First, it further improves the sparseness of the output activations of $D$ , which is beneficial to the sparse recovery on the following $\mathbf{W}$ . Second, it tends to reduce within-class scattering and thus improve the classification accuracy [17]. Overall, with the perfect combination of $W$ , $D$ and $f(\cdot)$ , we establish a compressed sensing framework for each layer, and then obtain a cascaded compressed sensing architecture with all layers. To the best of our knowledge, this is the first time that compressed sensing is introduced for layerwise learning to handle the target propagation problem.

Compared to auxiliary networks-based TP methods, the proposed compressed sensing-based TP method has two major advantages. First, it avoids the usage of auxiliary networks and thus reduces the complexity of network learning. Second, it could achieve target propagation with guaranteed accuracy. In layerwise learning, the amount of local targets we need to compute and store is equal to the number of activations in the layer. Considering convolution networks contain huge amounts of activations and pose great computational challenges to layerwise learning, as in [3, 10], we only test the proposed layerwise learning method on fully-connected networks for tractable computation. Our performance advantage is verified by the classification experiments on MNIST and CIFAR10.

II Method

Refer to caption — Figure 1: The proposed depth- $L$ network structure, with feedforward activations marked on the upper part and feedback targets marked on the lower part.

Let us consider a deep fully connected network with $L$ layers as shown in Fig. 1. Each layer consists of two sublayers $\{W_{l},D_{l}\}$ , where $W_{l}\in\mathbb{R}^{m_{l}\times n_{l}}$ with $m_{l}\ll n_{l}$ denotes a projection matrix and $D_{l}\in\mathbb{R}^{m_{l}\times m_{l}}$ means an orthonormal dictionary, $D^{\top}D=I$ . Following $D_{l}$ is a pointwise thresholding function $f_{l}(\cdot)$ which zeroes out all but the $k_{l}$ ( $\ll m_{l}$ ) largest positive elements, acting similarly to the ReLU function. Note that the output layer is a classifier which comprises $W_{L}$ but no $D_{L}$ . Suppose $h_{W}^{l}\in\mathbb{R}^{m_{l}}$ and $h_{D}^{l}\in\mathbb{R}^{m_{l}}$ are the output activations of $W_{l}$ and $D_{l}$ , respectively. The feedforward network structure is formulated by

h_{W}^{l}=W_{l}h_{D}^{l-1}\leavevmode\nobreak\ \leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ \leavevmode\nobreak\ h_{D}^{l}=f_{l}(D_{l}h_{W}^{l}),\leavevmode\nobreak\ \leavevmode\nobreak\ l=1,\ldots,L-1

(1)

h_{W}^{l}=g(W_{l}h_{D}^{l-1}),\leavevmode\nobreak\ \leavevmode\nobreak\ l=L

(2)

where $g(\cdot)$ is a softmax function, and the notations $h_{D}^{0}$ and $h_{W}^{L}$ denote the network’s input and output, respectively.

As sketched in Algorithm 1, the proposed layerwise learning is implemented within three phases. First, we initialize the network via feeding forward the network input $h_{D}^{0}$ . During this process, the projection matrices $W_{l}$ are randomly generated, and the orthonormal dictionaries $D_{l}$ are learned with their input activations ${h}_{W}^{l-1}$ . Second, as shown in the lower part of Fig.1, by compressed sensing the global targets $\tilde{h}_{W}^{L}$ are inversely propagated layer by layer, producing local targets $\tilde{h}_{W}^{l}$ and $\tilde{h}_{D}^{l}$ respectively for $W_{l}$ and $D_{l}$ , $1\leq l<L$ . Finally, we feed forward the input $h_{D}^{0}$ again, in order to successively minimize the local losses $\mathcal{L}_{l}({h}_{W}^{l}$ , $\tilde{h}_{W}^{l})$ and $\mathcal{L}_{l}({h}_{D}^{l},\tilde{h}_{D}^{l})$ of each layer by updating $W_{l}$ and $D_{l}$ , $1\leq l<L$ . This finally minimizes the global loss $\mathcal{L}({h}_{W}^{L}$ , $\tilde{h}_{W}^{L})$ . In the letter, we simply define the local losses with Euclidean distances and the global loss with cross entropy. In the following, we detail the three phases.

Algorithm 1 The proposed layerwise learning

1: Phase 1: Activation forward propagation.

2: for

l=1

L

3: Generate

h_{W}^{l}

and

h_{D}^{l}

by (1) and (2).

4: Initialize

W_{l}

with Gaussian distributions.

5: Learn

D_{l}

with its input

h_{W}^{l}

by Algorithm 2.

6: end for

7: Further update the final

W_{L}

by (3).

8: Phase 2: Target backward propagation.

9: Generate

\tilde{h}_{D}^{L-1}

by (6) and

\tilde{h}_{W}^{L-1}

by (5).

10: for

l=L-2

2

11: Generate

\tilde{h}_{D}^{l-1}

by (4) and

\tilde{h}_{W}^{l-1}

by (5).

12: end for

13: Phase 3: Forward layerwise learning.

14: for

l=1

L-1

15: Update

W_{l}

by (7) and

D_{l}

by (9) and (10).

16: end for

17: Update the final

W_{L}

by (12).

II-A Activation forward propagation

By (1) and (2), we know the network takes $h_{D}^{0}$ as input and $h_{W}^{L}$ as output. During the feedforward propagation, we initialize the projection matrix $W_{l}$ by drawing its elements from standard normal distribution. The distribution has been proved nearly optimal for compressed sensing [18]. To keep the input activation $h_{D}^{l-1}$ unchanged in $\ell_{2}$ norm after projections, $W_{l}$ needs to be scaled by a factor $1/\sqrt{m_{l}}$ [19]. For the output layer $W_{L}$ , besides random initialization, we further update it by

{W}_{L}={W}_{L}-\alpha\frac{\partial\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})}{\partial{W}_{L}},

(3)

in order to reduce the difference between the final output $h_{W}^{L}$ and the global target $\tilde{h}_{W}^{L}$ . As detailed later, the reduced difference is beneficial to the final layerwise optimization. Empirically, the parameter $\alpha$ in (3) should be small in case of overfitting.

Next, we move to learning the orthonormal dictionary $D_{l}$ , which has dense input $h_{W}^{l}$ and sparse output $h_{D}^{l}$ . Such kind of dictionaries can be learned with an procrustean approach [20], which is introduced in Algorithm 2 for completeness. In the algorithm, there is a crucial parameter $d$ , which counts the number of largest elements maintained by the thresholding function $T(\cdot)$ . To obtain sparse transforms, we should adopt a relatively small $d$ . Note that Algorithm 2 supports online learning, and the property is necessary for batch learning. From the viewpoint of classification, we can say that the sparse transform matrix $D_{l}$ is mainly used for feature selection, and the projection matrix $W_{l}$ serves not only for feature selection but for dimensionality reduction. Moreover, it is noteworthy that the order of $D_{l}$ and $W_{l}$ could be reversed in the cascaded compressed sensing framework. In the letter, we decide to put $W_{l}$ ahead in order to decrease the dimension of the network input $h_{D}^{0}$ and then reduce the optimization complexity.

Finally, let us see the initialization of the nonlinear function $f_{l}(\cdot)$ , which has one thresholding parameter to maintain the $k_{l}$ largest positive elements in the output of $D_{l}$ . Note that $f_{l}(\cdot)$ only keeps a portion of positive elements and discards the negative ones, because empirically both of them tend to share the similar information for classification. The choice of $k_{l}$ depends on both the sparse degree of the output of $D_{l}$ and the compression ratio of the following $W_{l+1}$ . Empirically, as discussed later, a large $k_{l}$ that allows for containing most of positive elements tends to provide good classification performance, when the dictionary output is not very sparse.

II-B Target backward propagation

The global target $\tilde{h}_{W}^{L}$ is propagated backward by repeating the following compressed sensing recovery process:

\tilde{h}_{D}^{l-1}=\arg\min_{{h}_{D}^{l-1}}\|\tilde{h}_{W}^{l}-W_{l}{h}_{D}^{l-1}\|_{2}^{2}\leavevmode\nobreak\ \text{s.t.}\leavevmode\nobreak\ \|{h}_{D}^{l-1}\|_{0}\leq k_{l-1}

(4)

and

\tilde{h}_{W}^{l-1}=D_{l-1}^{\top}\tilde{h}_{D}^{l-1}

(5)

where the layer index $l$ decreases from $L$ to 2; $\tilde{h}_{D}^{l-1}$ and $\tilde{h}_{W}^{l-1}$ denote the local targets derived respectively for $D_{L-1}$ and $W_{L-1}$ . The derivations of the two targets are analyzed as follows. Let us first see the deriving of $\tilde{h}_{D}^{l-1}$ by $\eqref{eq_backD}$ . By compressed sensing, the row size $m_{l}$ of $W_{l}$ in (4) should be at least twice its input sparsity $k_{l-1}$ , namely $m_{l}>2k_{l-1}$ , in order to approximately recover $\tilde{h}_{D}^{l-1}$ from $\tilde{h}_{W}^{l}$ . Empirically, the condition requires not to be strictly met, and a good classification is usually obtained as $m_{l}<2k_{l-1}$ . This implies that for layerwise learning, the feedforward of sufficient amounts of features may be more important than the accuracy of target propagation. We could solve $\eqref{eq_backD}$ with many algorithms [21]. In our experiments, we will employ the known OMP [22] algorithm, which has a parallel implementation in the software SPAMS [23]. As for the local target $\tilde{h}_{W}^{l-1}$ , it is seen in (5) that the target could be simply derived from $\tilde{h}_{D}^{l-1}$ by multiplication, benefiting from the orthonormality of $D_{l-1}$ .

Usually, we prefer to simply label each class with a one-hot vector $\tilde{h}_{W}^{L}$ , such that the instances of the same class will share the same local target at each layer. Note that the label sharing may cause difficulty for layerwise optimization, as the instances of the same class have high variations, as often encountered at early layers [3]. To avoid the problem, an effective solution is to adopt diverse labels/tagets to represent the instances of the same class [3]. To achieve this, instead of (4), we propose to derive the local target $\tilde{h}_{D}^{L-1}$ of the penultimate layer by

\tilde{h}_{D}^{L-1}={h}_{D}^{L-1}-\beta\frac{\partial\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})}{\partial{h}_{D}^{L-1}}

(6)

and then derive the targets for the rest layers by (4) and (5).

Algorithm 2 Online orthonormal dictionary learning [24]

1: Input: Dataset

X\in\mathbb{R}^{m\times n}

n

samples of length

m

and initialized dictionary

D^{(0)}

2: for

j=1

J

S^{(j)}=T({D^{(j)}}^{\top}X)

T(\cdot)

is a threshold function to zero out all but

d

largest elements (in magnitudes) in each column of its matrix input.

4: Run SVD for

X{S^{(j)}}^{\top}=P\Sigma Q^{\top}

D^{(j+1)}=PQ^{\top}

6: end for

7: Output: Learned dictionary

D={D^{(J)}}^{\top}

Considering the global loss $\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})$ has been previously decreased in the initialization phase by updating $W_{L}$ in (3), we use a small $\beta$ in (6) to further reduce the global loss to achieve a good classification performance. It is seen that a small $\beta$ will result in a small difference ${h}_{D}^{L-1}-\tilde{h}_{D}^{L-1}$ , and then a sequence of small ${h}_{D}^{l}-\tilde{h}_{D}^{l}$ (with $l$ decreasing from $L-2$ to $1$ ) by the stable recovery of compressed sensing [25]. The small differences between forward activations and backward targets will reduce the variations of $D_{l}$ and $W_{l}$ in optimization, while the low variations over optimization incline to improve the network’s generalization [11, 12, 13]. To achieve such property, the TP method in [3] constrains the difference between targets and activations, as learning the auxiliary network. However, the network-based method lacks stability, thus inferior to our compressed sensing-based method.

II-C Forward layerwise learning

Feeding forward the network input $h_{D}^{0}$ , we successively update the parameters $\{{W}_{l},{D}_{l}\}$ of each layer by matching their outputs ${h}_{W}^{l}$ and ${h}_{D}^{l}$ to their local targets $\tilde{h}_{W}^{l}$ and $\tilde{h}_{D}^{l}$ . Formally, this process is realized by iterating

\tilde{W}_{l}=\arg\min_{W_{l}}\|\tilde{h}_{W}^{l}-W_{l}{h}_{D}^{l-1}\|_{2}^{2}+\lambda_{W_{l}}\|W_{l}\|_{2}^{2}

(7)

{h}_{W}^{l}=\tilde{W}_{l}{h}_{D}^{l-1}

(8)

{h}_{W}^{l}\tilde{h}{{}_{D}^{l}}^{\top}=P\Sigma Q^{\top}

(9)

\tilde{D}_{l}=QP^{\top}

(10)

{h}_{D}^{l}=f_{l}(\tilde{D}_{l}{h}_{W}^{l})

(11)

where the index $l$ increases from $1$ to $L-1$ . As a classifier, the final output layer is updated by

\tilde{W}_{L}={W}_{L}-\eta\frac{\partial\mathcal{L}({h}_{W}^{L},\tilde{h}_{W}^{L})}{\partial{W}_{L}}.

(12)

Then the resulting $\{\tilde{W}_{l},\tilde{D}_{l}\}_{l=1}^{L}$ constitute the final network. Note that the dictionary updating rules (9) and (10) are adapted from Algorithm 2. For tractable computation, we simply update ${W}_{l}$ with the ridge regression model (7), although other more sophisticated models, such as the Lasso [26] and elastic-net [27], probably yield better updates.

For large-scale data, such as ImageNet [28], we suggest to process the data in batches and perform layerwise learning iteratively. But for small or middle-scaled data, such as MNIST [29] and CIFAR10 [30], the learning could be accomplished in a single pass. In this case, compressed sensing-based target propagation needs to be performed only once, implying that the dictionaries are not required to be orthonormal in the final forward optimization and they could be simply updated by

\tilde{D}_{l}=\arg\min_{D_{l}}\|\tilde{h}_{D}^{l}-D_{l}{h}_{W}^{l}\|_{2}^{2}+\lambda_{D_{l}}\|D_{l}\|_{2}^{2},

(13)

instead of the SVD method used in (9) and (10).

III Experiments

In this section, we aim to prove that the proposed compressed sensing-based target propagation (CSTP) method could achieve better performance than the auxiliary network-based target propagation (ANTP) method [3], on the layerwise learning of deep networks. Moreover, we analyze the proposed CSTP method by ablation study. For comparison, as in [3], we test the proposed CSTP method on two typical datasets, MNIST [29] and CIFAR10 [30]. The MINIST dataset consists of $28\times 28$ gray-scale images of handwritten digits. It has a training set of 60,000 samples, and a test set of 10,000 samples. The CIFAR10 dataset consists of $32\times 32$ natural color images in 10 classes. There are 50,000 samples in the training set and 10,000 samples in the testing set.

III-A Settings

In the following, we briefly introduce the parameters set for the proposed CSTP-based layerwise learning algorithm. As shown in Algorithm 1, the algorithm consists of three phases. In the initial feedforward phase, the dictionaries are learned with Algorithm 2, for which we set the iteration number to be 20 and the sparsity parameter $d\approx m_{l}/3$ . The thresholding function $f_{l}(\cdot)$ keeps about $k_{l}\approx m_{l}/2$ positive elements. By (3), we initialize the output layer $W_{L}$ , namely the classifier. To obtain better performance, we suggest to handle (3) with stochastic gradient descent (SGD), for which we set the momentum as 0.9, weight decay as 0.0005, batch size as 64, and epoch number as 30. The learning rate is initialized as 0.01, which is decayed by multiplying a factor of 0.9 every 4 epochs. In the target backpropagation phase, we first diversify the targets by (6), with $\beta=0.2$ . Then the targets are backpropagated by (4), which is implemented with OMP [22], with $k_{l}\approx n_{l}/3$ . In the final layerwise learning phase, we update $W_{l}$ and $D_{l}$ by directly solving (7) and (13). The output layer $W_{L}$ is updated in (12) by SGD, with the same parameters as set in (3). The above parameter settings are adopted both for MNIST and CIFAR10.

III-B Results

III-B1 Performance comparison on CIFAR10

For the layerwise learning with CSTP, we derive a test error of 47.76% on a three-layer network of size 3072-2500-1500-10. In contrast, in [3] ANTP reports an error of 49.29% on a four-layer network of size 3072-1000-1000-1000-10. This means that CSTP could achieve higher accuracy than ANTP, when given appropriate network sizes. Recall that the network size of CSTP is constrained by compressed sensing. Similarly to ANTP [3], the CSTP-based layerwise learning performs worse than the prevailing BP method, with an accuracy gap about 4% on the network we test. As discussed in [31], the inferior performance is due to the mismatching between the poorly-separable features and the strong supervision constraint imposed on each layer.

III-B2 Performance comparison on MNIST

In [3], ANTP achieves a test error of 1.94% on a network with 7 hidden layers each consisting of 240 units. The error is reduced about 0.15%, as we test CSTP on a wide but shallow network of size 784-3500-2500-10. From Table 1, it is seen that wider layers tend to yield better performance for CSTP. The reason is as follows. The performance of CSTP depends on the dictionary learning-based feature selection and compressed sensing-based target propagation. For dictionary learning, higher dimensions (namely wider layers) usually could lead to sparser transforms, which are beneficial not only to feature selection but also to compressed sensing.

III-B3 Ablation study of $D_{l}$ and $f_{l}(\cdot)$

To see the importance of the dictionary $D_{l}$ and nonlinear function $f_{l}(\cdot)$ in CSTP, we remove them in Algorithm 1, and after fine tuning, we still witness an accuracy reduction about 1.2% for CIFAR10 and 0.7% for MNIST. If only removing $f_{l}(\cdot)$ , the accuracy decreases about 0.3% for CIFAR10 and 0.1% for MNIST.

Table I: The layerwise learning performance of CSTP on different network architectures.

Datasets	Network sizes	Error ( $\%$ )
MNIST	784-4000-3000-10	1.72
	784-3500-2500-10	1.79
	784-2000-1000-10	2.86
CIAFR10	3072-4000-3000-10	47.48
	3072-2500-1500-10	47.76
	3072-2000-1000-10	48.87

IV Conclusion

In the letter, we have shown that the cascade of compressed sensing can be established as a reversible network, which allows for layerwise learning and analysis. The optimization target of each layer is derived by inversely propagating the target of the network via compressed sensing. The method avoids learning an auxiliary network to specially propagate the target, reducing the complexity of network learning. Also, the method could achieve favorable performance owing to the stableness of compressed sensing. The proposed target propagation involves two major computations: dictionary learning and sparse recovery, which both have polynomial complexity. To reduce the complexity, it is interesting to sparsify the structures of the dictionaries and projection matrices, such that some combinatorial algorithms with linear complexity could be adopted [32, 33]. This will be left for our future work. Note that in the letter we focus our interest on the layerwise learning method that optimizes exactly one layer at a time and thus reduces the network analysis to the layer level. Currently, the method seems difficult to achieve BP level performance, especially on large-scale data [31]. The reason is that the finite number of parameters in one layer can hardly handle the matching between poorly-separable features and strong supervision constraints [31]. The problem could be addressed if more than one layer is learned at a time [5, 6].

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012.
[2] Y. Bengio, “How auto-encoders could provide credit assignment in deep networks via target propagation,” arXiv preprint arXiv:1407.7906, 2014.
[3] D.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio, “Difference target propagation,” in Joint european conference on machine learning and knowledge discovery in databases. Springer, 2015, pp. 498–515.
[4] E. S. Marquez, J. S. Hare, and M. Niranjan, “Deep cascade learning,” IEEE transactions on neural networks and learning systems, vol. 29, no. 11, pp. 5475–5485, 2018.
[5] E. Belilovsky, M. Eickenberg, and E. Oyallon, “Greedy layerwise learning can scale to imagenet,” in International conference on machine learning. PMLR, 2019, pp. 583–593.
[6] A. Nøkland and L. H. Eidnes, “Training neural networks with local error signals,” in International Conference on Machine Learning. PMLR, 2019, pp. 4839–4850.
[7] A. R. Barron, “Approximation and estimation bounds for artificial neural networks,” Machine learning, vol. 14, no. 1, pp. 115–133, 1994.
[8] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, pp. 143–195, 1999.
[9] F. Bach, “Breaking the curse of dimensionality with convex neural networks,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 629–681, 2017.
[10] S. Bartunov, A. Santoro, B. A. Richards, L. Marris, G. E. Hinton, and T. Lillicrap, “Assessing the scalability of biologically-motivated deep learning algorithms and architectures,” arXiv preprint arXiv:1807.04587, 2018.
[11] G. K. Dziugaite and D. M. Roy, “Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data,” arXiv preprint arXiv:1703.11008, 2017.
[12] V. Nagarajan and J. Z. Kolter, “Generalization in deep networks: The role of distance from initialization,” arXiv preprint arXiv:1901.01672, 2019.
[13] L. Q. Trinh, “Greedy layerwise training of convolutional neural networks,” Ph.D. dissertation, Massachusetts Institute of Technology, 2019.
[14] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE Transactions on Information Theory, vol. 49, no. 12, pp. 3320–3325, Dec. 2003.
[15] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via $l^{1}$ minimization,” Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003.
[16] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, no. 1, pp. 129–159, 2001.
[17] J. Zarka, L. Thiry, T. Angles, and S. Mallat, “Deep network classification by scattering and homotopy dictionary learning,” in International Conference on Learning Representations, 2020.
[18] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203 – 4215, Dec. 2005.
[19] W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” Contemp. Math., vol. 26, pp. 189–206, 1984.
[20] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
[21] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang, “A survey of sparse representation: algorithms and applications,” IEEE access, vol. 3, pp. 490–530, 2015.
[22] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition,” in Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, Nov. 1993, pp. 40–44.
[23] J. Mairal, F. Bach, and J. Ponce, Sparse Modeling for Image and Vision Processing, ser. Foundations and Trends in Computer Graphics and Vision. Now publishers, Dec. 2014, vol. 8, no. 2-3.
[24] C. Bao, J.-F. Cai, and H. Ji, “Fast sparsity-based orthogonal dictionary learning for image restoration,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3384–3391.
[25] E. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics, vol. 59, pp. 1207–1223, 2006.
[26] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, vol. 58, no. 1, pp. 267–288, 1996.
[27] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the royal statistical society: series B (statistical methodology), vol. 67, no. 2, pp. 301–320, 2005.
[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
[29] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
[30] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
[31] W. Ma, M. Yu, K. Li, and G. Wang, “Why layer-wise learning is hard to scale-up and a possible solution via accelerated downsampling,” in IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2020, pp. 238–243.
[32] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proceedings of the IEEE, vol. 98, no. 6, pp. 937–947, June 2010.
[33] W. Lu, W. Li, W. Zhang, and S.-T. Xia, “Expander recovery performance of bipartite graphs with girth greater than 4,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 3, pp. 418–427, 2018.