ukdate
Nonlinear Tensor Ring Network
Abstract
The state-of-the-art deep neural networks (DNNs) have been widely applied for various real-world applications, and achieved significant performance for cognitive problems. However, the increment of DNNs’ width and depth in architecture results in a huge amount of parameters to challenge the storage and memory cost, limiting to the usage of DNNs on resource-constrained platforms, such as portable devices. By converting redundant models into compact ones, compression technique appears to be a practical solution to reducing the storage and memory consumption. In this paper, we develop a nonlinear tensor ring network (NTRN) in which both fully-connected and convolutional layers are compressed via tensor ring decomposition. Furthermore, to mitigate the accuracy loss caused by compression, a nonlinear activation function is embedded into the tensor contraction and convolution operations inside the compressed layer. Experimental results demonstrate the effectiveness and superiority of the proposed NTRN for image classification using two basic neural networks, LeNet-5 and VGG-11 on three datasets, viz. MNIST, Fashion MNIST and Cifar-10.
Index Terms:
Deep neural network, network compression, tensor decomposition, nonlinear tensor ring.I Introduction
Recently, deep neural networks (DNNs) have attracted increasing attention due to their excellent performance in various fields, such as speech recognition [1, 2], image denoising [3], video categorization [4], object detection [5] and bioinformatics [6]. A pivotal characteristic of the DNNs is large depth since increasing the number of layers is able to improve the representational ability for attaining higher accuracy [7]. The vast depth generates an enormous amount of parameters that requires huge storage and memory space, restricting the DNN deployment on resource-constrained devices, such as mobile phones, wearables and internet of things (IoTs). To that end, network compression technique [8] is proposed to compress DNNs.
In general, the methods for neural network compression can be roughly classified into four categories, namely, parameter pruning and quantization, low-rank factorization, transferred/compact convolution filters, and knowledge distillation [9]. Compared with the others, low-rank factorization is easy to implement for supporting scratch and pre-trained neural networks because of its standardized pipeline. Thereby, in this work, we focus on low-rank factorization that leverages the low-rank property in matrix/tensor to decompose a weight array into multiple small-size factors for reducing the number of parameters. Low-rank matrix decomposition was proposed to compress multi-layer perceptrons (MLPs) [10] and convolutional neural networks (CNNs) [11]. However, the compressed models inevitably suffer from the performance degradation. To alleviate the loss of accuracy at high compression ratio (CR), [12] suggested to combine the low-rank and sparse decomposition to compress DNNs. The joint strategy factorizes a weight matrix into two small-size factor matrices and one sparse matrix based on the pre-trained neural network, and then fine-tunes the matrices to improve performance. In addition, neural network compression was considered as a multi-objective optimization problem in [13], where CR and classification error ratio were optimized simultaneously. Although it can achieve satisfactory tradeoff between CR and classification error ratio, compression artifacts may exist in low-rank matrix decomposition based approaches because they have to reshape the tensorial weights in convolutional layers into a matrix.
To avoid impairing tensor structure, low-rank tensor decomposition is utilized to compress DNNs, and has achieved higher CR and accuracy compared with the counterparts. The prior work to compress DNNs was using Tucker decomposition [14], followed by the CANDECOMP/PARAFAC (CP) decomposition [15] to reduce performance loss at high CR. Besides, tensor train (TT) [16] and tensor ring (TR) [17, 18] models were utilized for neural network compression. They convert the weight array into a high order tensor for a high CR, and then factorize the corresponding tensor into multiple core tensors. Specifically, TR achieved a 18-times (18x) CR with only 0.1% accuracy loss on LeNet-5 model for processing MNIST dataset. Furthermore, Sun [19] leveraged the structure sharing feature in ISTA-Net [20] and ResNet [7] to achieve a high CR with small accuracy loss. Recently, a nonlinear TT (NTT) method [21] was developed to insert a dynamic activation function between two adjacent core tensors to boost performance.
In this work, we aim to leverage low-rank tensor factorization to compress DNNs. It has been demonstrated that TR format is able to achieve higher accuracy than TT decomposition at the same CR [22]. Thereby, we adapt TR factorization to devise a nonlinear TR network (NTRN), where both fully-connected and convolutional layers are compressed by TR factorization. Besides, a nonlinear activation function is added after tensor contraction and convolution operations inside the compressed layer. Different from the NTT format that unfolds the tensorial weights in convolutional layers into a matrix, the proposed NTRN takes action directly on tensorial weights, without distorting the spatial information in convolution kernels. To verify the expressive ability of the proposed NTRN, we test the NTRN with two basic neural networks, LeNet-5 [23] and VGG-11 [24] on MNIST [23], Fashion MNIST [25] and Cifar-10 [26] datasets. The experimental results demonstrate that our NTRN is able to alleviate the accuracy loss compared with the TR based method.
II Preliminaries
Consider a th-order tensor , TT decomposition represents by two matrices and core tensors, that is
(1) |
where denotes the entry, , , with are termed as core tensors, and are TT ranks. By contrast, TR decomposes into 3rd-order tensors, shown as
(2) |
where with and are TR ranks. The main difference between TT and TR factorization is in the first and last core arrays. Specifically, the TT format requires two matrices in the first and last positions, while all core arrays are 3rd-order tensors in the TR factorization. Thereby, the TT decomposition might result in large intermediate cores and small boundary factors, which restricts its representational ability and flexibility. Besides, multiplying TT cores must obey a strict order, and hence the convolution kernel in convolutional layers is unfolded into a vector for performing sequenced operation [16]. This motivates us to devise a nonlinear TR format for neural network compression.
III Nonlinear Tensor Ring Network
In this section, the proposed NTRN to compress fully-connected and convolutional layers is introduced, respectively, in detail.
III-A Fully-Connected Layer Decomposition
A fully-connected layer maps an input vector into an output vector via a weight matrix . Mathematically, it is formulated as
(3) |
where denotes the transpose operator. Neural network compression by low-rank matrix/tensor factorization is to decompose into some small-size factors. To compress in tensor structure, we first reshape into a high order tensor with
(4) |
Then, based on TR format, is factorized into 3rd-order tensors
(5) |
where implies the outer product operation, with , with , and is the vertical fiber of . Moreover, and need to be tensorized as and , respectively. Thereby, (3) can be rewritten in the tensor format:
(6) |
where and are the tensor contraction operations [27]. Consider and with and , yields a th-order tensor whose entries are calculated by
(7) |
Similarly, generates a th-order tensor with entries being
(8) |
It is worth noting that (III-A) is the same as the fully-connected TR network in [17]. To boost the accuracy, we propose to include a nonlinear activation function after each tensor contraction in (III-A), resulting in
(9) |
where signifies the nonlinear activation function, e.g., Tanh in Fig. 1. Note that is not added after since the intrinsic activation function outside the current layer replaces . Fig. 1 illustrates the fully-connected layer in NTRN.

III-B Convolutional Layer Decomposition
At the convolutional layer, an input tensor is convoluted with a 4th-order kernel tensor to output a tensor , formulated as
(10) |
where denotes the convolution operation in CNNs. To compress , TR decomposition factorizes it into three core tensors, viz. , and . Therefore, is computed by
(11) |
It can be known that retains convolution kernel, which is different from NTT format. In NTRN, the convolution operation (10) is decomposed into the following procedure:
(12) | ||||
(13) | ||||
(14) |
where , and is the 1st tensor permutation. The relationship between and obeys
(15) |
For the convolutional layer in NTRN, the input tensor is first contracted with the intermediate factor , then convoluted with the first core , and finally contracted with the last core . It is worth mentioning that this procedure is entirely different from the operation in NTT format. On the other hand, when and are large, and can be further factorized into small-size core tensors for achieving a higher CR. For instance, can be reshaped as with , and then is factorized into three tensors, namely, , and . To calculate from its core tensors, the proposed NTRN adopts a nonlinear contraction operation
(16) |
The convolutional layer in NTRN is illustrated in Fig. 2.

III-C Compression Ratio
CR is defined as:
(17) |
For a fully-connected layer, the number of parameters is given a weight matrix . The proposed NTRN requires parameters with the assumption of . On the other hand, the convolutional layer of conventional neural network needs to store parameters for . In NTRN, the dimensions of are reshaped into for a high CR. Hence, only parameters are required. It is known that decreasing is able to increase CR.
IV Experimental Results
In this section, the proposed NTRN is evaluated via basic MLP and CNN, LeNet-5 [23], and VGG-11 [24] on MNIST [23], Fashion MNIST [25] and Cifar-10 [26] datasets. All networks are programmed on PyTorch framework [28]. The experiments for the MLP, CNN and LeNet-5 are implemented on Nvidia GTX 2060 GPU, while VGG-11 is performed on Nvidia GTX 2080Ti GPU. In our experiments, TR ranks in an individual layer are the same, but can be diverse in different layers. If the activation function outside compressed layers, and the pooling function in CNNs are not specified, they are set as ReLU and maxpooling (size = 2, stride = 2), respectively.
IV-A Fully-Connected Layer Evaluation
CR | Acc | Std | H | L | Storage | |
---|---|---|---|---|---|---|
Original | 1x | 98.15 | 0.001 | 98.3 | 98.0 | 5.1 MB |
TRN [17] | 57x | 97.34 | 0.003 | 97.7 | 97.0 | 0.1 MB |
NTRN | 57x | 97.72 | 0.002 | 97.9 | 97.4 | 0.1 MB |
TRN [17] | 359x | 95.94 | 0.004 | 96.3 | 95.2 | 0.02 MB |
NTRN | 359x | 96.51 | 0.002 | 96.7 | 96.3 | 0.02 MB |
We first test the proposed NTRN using a MLP on MNIST dataset. The MLP consists of input, two hidden and output layers with 784, 1024, 512 and 10 nodes, respectively. All images in the MNIST are reshaped as a vector of length . Therefore, the weight matrices of the MLP are , and . To compress them, , and are tensorized, and hence their dimensions are , and , respectively. Epoch is set as 50 for training. The experimental results based on 20 independent trials are tabulated in Table I, where Acc and Std denote the average accuracy on the test set and the standard deviation of all results, respectively. In addition, the highest and lowest accuracy among all trials are listed in columns H and L, respectively. Moreover, storage indicates the disk space occupied by these parameters. At 57x CR, the ranks of three layers are , while the ranks are set as 359x CR. We can see that the proposed NTRN outperforms TRN [17] at both low and high CRs in terms of the average accuracy, the standard deviation, the highest and lowest accuracy.
IV-B Convolutional Layer Evaluation
We now investigate the compression performance via a CNN with two hidden convolutional and one fully-connected layers on MNIST. The first and second convolutional layers contain 16 and 32 kernels with kernel-size equaling 3, respectively. Moreover, the stride is set to 1, and the padding is 0. Since the dimensions of input data are , the weight tensor at the first convolutional layer is . Besides, the weight arrays at the second convolutional and fully-connected layers are expressed as and , respectively. To compress weight arrays of the convolutional layers, the dimensions of and are reshaped as and , respectively. Following the ablation study, the fully-connected layer in CNN is not compressed. Hence, we do not resize its weight matrix.
The results are shown in Table II. The ranks at 4.3x and 8.9x CRs are and , respectively. It is seen that the proposed NTRN attains higher accuracy than TRN at the same CR. While the standard deviation of NTRN is smaller than that of TRN. Note that the storage information is not provided as the fully-connected layer is not compressed.
IV-C LeNet-5 Evaluation
CR | Acc | Std | H | L | Storage | |
---|---|---|---|---|---|---|
Original | 1x | 89.86 | 0.005 | 90.7 | 89.0 | 2.6 MB |
TRN [17] | 13x | 88.10 | 0.005 | 89.0 | 87.1 | 0.2 MB |
NTRN | 13x | 88.53 | 0.004 | 89.4 | 87.8 | 0.2 MB |
TRN [17] | 72x | 87.91 | 0.006 | 88.7 | 87.0 | 0.04 MB |
NTRN | 72x | 88.47 | 0.003 | 88.9 | 87.7 | 0.04 MB |
With promising results on monotypic layer compression, we conduct experiments using LeNet-5 [23] on the Fashion MNIST dataset to further evaluate the proposed NTRN, where the LeNet-5 is made up of two convolutional and two fully-connected layers. The resized weight tensors from input to output layers are , , and . As shown in Table III, the proposed NTRN is superior to the TRN at both 13x and 72x CRs. Herein, the ranks are for 13x CR, and for 72x CR.
IV-D VGG-11 Evaluation
Furthermore, the developed NTRN is examined by VGG-11 on two datasets, namely, Cifar-10 and Fashion MNIST. The VGG-11 consists of eight convolutional and three fully-connected layers. For the Cifar-10 dataset, the weight tensors from input to output layers are , , , , , , , and , respectively. Excluding the first weight tensor for the Fashion MNIST dataset, the dimensions of the others are kept the same as those in Cifar-10. The ranks are set to for 9x CR. By contrast, the ranks are at 17x CR.
CR | Acc | Std | H | L | Storage | |
---|---|---|---|---|---|---|
Original | 1x | 80.40 | 0.005 | 81.0 | 79.4 | 36.6 MB |
TRN [17] | 9x | 76.44 | 0.006 | 77.2 | 75.4 | 4.2 MB |
NTRN | 9x | 77.87 | 0.002 | 78.2 | 77.6 | 4.2 MB |
TRN [17] | 17x | 75.67 | 0.005 | 76.6 | 75.1 | 2.2 MB |
NTRN | 17x | 77.50 | 0.007 | 78.5 | 76.3 | 2.2 MB |
Table IV lists the results on the Cifar-10 dataset. It is known that NTRN demonstrates overwhelming superiority compared with TRN. Specifically, the average accuracy by the proposed NTRN increases by 1.4% and 1.8% at 9x and 17x CRs, respectively. The corresponding results for the Fashion MNIST are tabulated in Table V. Compared with TRN, the proposed NTRN effectively mitigates the loss of accuracy caused by compression.
V Conclusion
In this paper, we have proposed a novel network compression technique, termed as NTRN, where the weight arrays at fully-connected and convolutional layers are compressed by TR format. Different from the conventional TRN, a nonlinear activation function is added after tensor contraction and convolution operations inside the compressed layer. The proposed NTRN enables to enhance accuracy, as compared to the state-of-the-art TRN compression method. The superior performance of our NTRN has been verified using image classification task by different DNN’s architectures, such as MLP, LeNet-5 and VGG-11 on MNIST, Fashion MNIST and Cifar-10 datasets. We believe that the proposed NTRN can be potentially used for embedded systems because of its effectiveness to achieve ultra-low memory cost.
References
- [1] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, Jan. 2013.
- [2] Q. Liu and J. Wu, “Parameter tuning-free missing-feature reconstruction for robust sound recognition,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 1, pp. 78–89, Jan. 2020.
- [3] Y. Wang, X. Song, and K. Chen, “Channel and space attention neural network for image denoising,” IEEE Signal Process. Lett., vol. 28, pp. 424–428, Feb. 2021.
- [4] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 352–364, Feb. 2017.
- [5] Y. Guo, Z. Zhang, Y. Huang, and P. Zhang, “DOA estimation method based on cascaded neural network for two closely spaced sources,” IEEE Signal Process. Lett., vol. 27, pp. 570–574, Apr. 2020.
- [6] Y. Li, C. Huang, L. Ding, Z. Li, Y. Pan, and X. Gao, “Deep learning in bioinformatics: Introduction, application, and perspective in the big data era,” Methods, vol. 166, pp. 4–21, Aug. 2019.
- [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, Nevada, USA., Jun. 2016, pp. 770–778.
- [8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in Proc. Int. Conf. Learn. Represent., San Juan, Puerto Rico, May 2016.
- [9] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv preprint arXiv:1710.09282, 2017.
- [10] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Vancouver, BC, Canada, Oct. 2013, pp. 6655–6659.
- [11] C. Tai, T. Xiao, Y. Zhang, and X. Wang, “Convolutional neural networks with low-rank regularization,” in Proc. Int. Conf. Learn. Represent., San Juan, Puerto Rico, May 2016.
- [12] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, Hawaii, USA, Feb. 2017, pp. 7370–7379.
- [13] J. Huang, W. Sun, and L. Huang, “Deep neural networks compression learning based on multiobjective evolutionary algorithms,” Neurocomputing, vol. 378, pp. 260–269, Feb. 2020.
- [14] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530, 2015.
- [15] M. Astrid and S.-I. Lee, “CP-decomposition with tensor power method for convolutional neural networks compression,” in Proc. IEEE Int. Conf. Big Data Smart Comput., Jeju, South Korea, Feb. 2017, pp. 115–118.
- [16] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov, “Ultimate tensorization: Compressing convolutional and FC layers alike,” arXiv preprint arXiv:1611.03214, 2016.
- [17] W. Wang, Y. Sun, B. Eriksson, W. Wang, and V. Aggarwal, “Wide compression: Tensor ring nets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, USA, Jun. 2018, pp. 9329–9338.
- [18] Q. Zhao, M. Sugiyama, L. Yuan, and A. Cichocki, “Learning efficient tensor representations with ring-structured networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Brighton, UK, May 2019, pp. 8608–8612.
- [19] W. Sun, S. Chen, L. Huang, H. C. So, and M. Xie, “Deep convolutional neural network compression via coupled tensor decomposition,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 3, pp. 603–616, Nov. 2020.
- [20] J. Zhang and B. Ghanem, “ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 1828–1837.
- [21] D. Wang, G. Zhao, H. Chen, Z. Liu, L. Deng, and G. Li, “Nonlinear tensor train format for deep neural network compression,” Neural Netw., vol. 144, pp. 320–333, Dec. 2021.
- [22] Y. Pan, J. Xu, M. Wang, J. Ye, F. Wang, K. Bai, and Z. Xu, “Compressing recurrent neural networks with tensor ring for action recognition,” in Proc. AAAI Conf. Artif. Intell., Honolulu, Hawaii, USA, Jul. 2019, pp. 4683–4690.
- [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
- [24] W. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn., San Diego, CA, USA, May 2015.
- [25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
- [26] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
- [27] A. Cichocki, N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic, “Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: Perspectives and challenges part 1,” arXiv preprint arXiv:1609.00893, Sep. 2016.
- [28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, “Pytorch: An imperative style, high-performance deep learning library,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 8026–8037, 2019.