\newdateformat

ukdate

Nonlinear Tensor Ring Network

Xiao Peng Li, Qi Liu, , Hing Cheung So Xiao Peng Li and Hing Cheung So are with the Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China (e-mail: [email protected]; [email protected]). (Corresponding author: Qi Liu.)Q. Liu is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore (e-mail: [email protected]).

Abstract

The state-of-the-art deep neural networks (DNNs) have been widely applied for various real-world applications, and achieved significant performance for cognitive problems. However, the increment of DNNs’ width and depth in architecture results in a huge amount of parameters to challenge the storage and memory cost, limiting to the usage of DNNs on resource-constrained platforms, such as portable devices. By converting redundant models into compact ones, compression technique appears to be a practical solution to reducing the storage and memory consumption. In this paper, we develop a nonlinear tensor ring network (NTRN) in which both fully-connected and convolutional layers are compressed via tensor ring decomposition. Furthermore, to mitigate the accuracy loss caused by compression, a nonlinear activation function is embedded into the tensor contraction and convolution operations inside the compressed layer. Experimental results demonstrate the effectiveness and superiority of the proposed NTRN for image classification using two basic neural networks, LeNet-5 and VGG-11 on three datasets, viz. MNIST, Fashion MNIST and Cifar-10.

Index Terms:

Deep neural network, network compression, tensor decomposition, nonlinear tensor ring.

I Introduction

Recently, deep neural networks (DNNs) have attracted increasing attention due to their excellent performance in various fields, such as speech recognition [1, 2], image denoising [3], video categorization [4], object detection [5] and bioinformatics [6]. A pivotal characteristic of the DNNs is large depth since increasing the number of layers is able to improve the representational ability for attaining higher accuracy [7]. The vast depth generates an enormous amount of parameters that requires huge storage and memory space, restricting the DNN deployment on resource-constrained devices, such as mobile phones, wearables and internet of things (IoTs). To that end, network compression technique [8] is proposed to compress DNNs.

In general, the methods for neural network compression can be roughly classified into four categories, namely, parameter pruning and quantization, low-rank factorization, transferred/compact convolution filters, and knowledge distillation [9]. Compared with the others, low-rank factorization is easy to implement for supporting scratch and pre-trained neural networks because of its standardized pipeline. Thereby, in this work, we focus on low-rank factorization that leverages the low-rank property in matrix/tensor to decompose a weight array into multiple small-size factors for reducing the number of parameters. Low-rank matrix decomposition was proposed to compress multi-layer perceptrons (MLPs) [10] and convolutional neural networks (CNNs) [11]. However, the compressed models inevitably suffer from the performance degradation. To alleviate the loss of accuracy at high compression ratio (CR), [12] suggested to combine the low-rank and sparse decomposition to compress DNNs. The joint strategy factorizes a weight matrix into two small-size factor matrices and one sparse matrix based on the pre-trained neural network, and then fine-tunes the matrices to improve performance. In addition, neural network compression was considered as a multi-objective optimization problem in [13], where CR and classification error ratio were optimized simultaneously. Although it can achieve satisfactory tradeoff between CR and classification error ratio, compression artifacts may exist in low-rank matrix decomposition based approaches because they have to reshape the tensorial weights in convolutional layers into a matrix.

To avoid impairing tensor structure, low-rank tensor decomposition is utilized to compress DNNs, and has achieved higher CR and accuracy compared with the counterparts. The prior work to compress DNNs was using Tucker decomposition [14], followed by the CANDECOMP/PARAFAC (CP) decomposition [15] to reduce performance loss at high CR. Besides, tensor train (TT) [16] and tensor ring (TR) [17, 18] models were utilized for neural network compression. They convert the weight array into a high order tensor for a high CR, and then factorize the corresponding tensor into multiple core tensors. Specifically, TR achieved a 18-times (18x) CR with only 0.1% accuracy loss on LeNet-5 model for processing MNIST dataset. Furthermore, Sun $et$ $al.$ [19] leveraged the structure sharing feature in ISTA-Net [20] and ResNet [7] to achieve a high CR with small accuracy loss. Recently, a nonlinear TT (NTT) method [21] was developed to insert a dynamic activation function between two adjacent core tensors to boost performance.

In this work, we aim to leverage low-rank tensor factorization to compress DNNs. It has been demonstrated that TR format is able to achieve higher accuracy than TT decomposition at the same CR [22]. Thereby, we adapt TR factorization to devise a nonlinear TR network (NTRN), where both fully-connected and convolutional layers are compressed by TR factorization. Besides, a nonlinear activation function is added after tensor contraction and convolution operations inside the compressed layer. Different from the NTT format that unfolds the tensorial weights in convolutional layers into a matrix, the proposed NTRN takes action directly on tensorial weights, without distorting the spatial information in convolution kernels. To verify the expressive ability of the proposed NTRN, we test the NTRN with two basic neural networks, LeNet-5 [23] and VGG-11 [24] on MNIST [23], Fashion MNIST [25] and Cifar-10 [26] datasets. The experimental results demonstrate that our NTRN is able to alleviate the accuracy loss compared with the TR based method.

II Preliminaries

Consider a $N$ th-order tensor $\boldsymbol{\mathcal{A}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}$ , TT decomposition represents $\boldsymbol{\mathcal{A}}$ by two matrices and $N-2$ core tensors, that is

	$\displaystyle\boldsymbol{\mathcal{A}}(i_{1},i_{2},\!\cdots\!,$	$\displaystyle i_{n})\!=\!\!\sum_{r_{1},r_{2},\!\cdots\!,r_{N-1}=1}^{R_{1},R_{2},\!\cdots\!,R_{N-1}}\!\boldsymbol{G}_{1}(i_{1},\!r_{1})\boldsymbol{\mathcal{G}}_{2}(r_{1},\!i_{2},\!r_{2})\!\cdots\!$
		$\displaystyle\boldsymbol{\mathcal{G}}_{N-1}(r_{N-2},\!i_{N-1},\!r_{N-1})\boldsymbol{G}_{N}(r_{N-1},\!i_{N})$		(1)

where $\boldsymbol{\mathcal{A}}(i_{1},i_{2},\!\cdots\!,i_{n})$ denotes the $(i_{1},i_{2},\!\cdots\!,i_{n})$ entry, $\boldsymbol{G}_{1}\in\mathbb{R}^{I_{1}\times R_{1}}$ , $\boldsymbol{G}_{N}\in\mathbb{R}^{R_{N-1}\times I_{N}}$ , $\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k-1}\times I_{k}\times,R_{k}}$ with $k\in[2,N-1]$ are termed as core tensors, and $[R_{1},R_{2},\cdots,R_{N-1}]$ are TT ranks. By contrast, TR decomposes $\boldsymbol{\mathcal{A}}$ into $N$ 3rd-order tensors, shown as

	$\displaystyle\boldsymbol{\mathcal{A}}(i_{1},i_{2},\!\cdots\!,$	$\displaystyle i_{n})\!=\!\!\sum_{r_{1},r_{2},\!\cdots\!,r_{N}=1}^{R_{1},R_{2},\!\cdots\!,R_{N}}\!\boldsymbol{\mathcal{G}}_{1}(r_{1},i_{1},\!r_{2})\boldsymbol{\mathcal{G}}_{2}(r_{2},\!i_{2},\!r_{3})\!\cdots\!$
		$\displaystyle\boldsymbol{\mathcal{G}}_{N-1}(r_{N-1},\!i_{N-1},\!r_{N})\boldsymbol{\mathcal{G}}_{N}(r_{N},\!i_{N},r_{1})$		(2)

where $\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k}\times I_{k}\times R_{k+1}}$ with $k\in[1,N]$ and $[R_{1},R_{2},\cdots,R_{N-1},R_{N}]$ are TR ranks. The main difference between TT and TR factorization is in the first and last core arrays. Specifically, the TT format requires two matrices in the first and last positions, while all core arrays are 3rd-order tensors in the TR factorization. Thereby, the TT decomposition might result in large intermediate cores and small boundary factors, which restricts its representational ability and flexibility. Besides, multiplying TT cores must obey a strict order, and hence the convolution kernel in convolutional layers is unfolded into a vector for performing sequenced operation [16]. This motivates us to devise a nonlinear TR format for neural network compression.

III Nonlinear Tensor Ring Network

In this section, the proposed NTRN to compress fully-connected and convolutional layers is introduced, respectively, in detail.

III-A Fully-Connected Layer Decomposition

A fully-connected layer maps an input vector $\boldsymbol{x}\in\mathbb{R}^{I}$ into an output vector $\boldsymbol{y}\in\mathbb{R}^{O}$ via a weight matrix $\boldsymbol{W}\in\mathbb{R}^{I\times O}$ . Mathematically, it is formulated as

\displaystyle\boldsymbol{y}=\boldsymbol{W}^{T}\boldsymbol{x}

(3)

where $(\cdot)^{T}$ denotes the transpose operator. Neural network compression by low-rank matrix/tensor factorization is to decompose $\boldsymbol{W}$ into some small-size factors. To compress $\boldsymbol{W}$ in tensor structure, we first reshape $\boldsymbol{W}$ into a high order tensor $\boldsymbol{\mathcal{W}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}\times O_{1}\times\cdots\times O_{M}}$ with

\displaystyle\prod_{n=1}^{N}I_{n}=I,\prod_{m=1}^{M}O_{m}=O.

(4)

Then, based on TR format, $\boldsymbol{\mathcal{W}}$ is factorized into $(N+M)$ 3rd-order tensors

	$\displaystyle\boldsymbol{\mathcal{W}}=\sum\limits_{r_{1},\cdots,r_{n}=1}^{R_{1},\cdots,R_{M+N}}$	$\displaystyle\boldsymbol{\mathcal{G}}_{1}(r_{1},:,r_{2})\circ\boldsymbol{\mathcal{G}}_{2}(r_{2},:,r_{3})\circ\cdots$
		$\displaystyle\circ\boldsymbol{\mathcal{G}}_{M+N}(r_{M+N},:,r_{1})$		(5)

where $\circ$ implies the outer product operation, $\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k}\times I_{k}\times R_{k+1}}$ with $k\in[1,N]$ , $\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k}\times O_{k}\times R_{k+1}}$ with $k\in[N+1,M]$ , and $\boldsymbol{\mathcal{G}}_{k}(r_{k},:,r_{k+1})$ is the vertical fiber of $\boldsymbol{\mathcal{G}}_{k}$ . Moreover, $\boldsymbol{x}$ and $\boldsymbol{y}$ need to be tensorized as $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}$ and $\boldsymbol{\mathcal{Y}}\in\mathbb{R}^{O_{1}\times\cdots\times O_{M}}$ , respectively. Thereby, (3) can be rewritten in the tensor format:

	$\displaystyle\boldsymbol{\mathcal{Y}}=$	$\displaystyle\boldsymbol{\mathcal{X}}\times_{1}^{2}\boldsymbol{\mathcal{G}}_{1}\times_{1,N+1}^{2,1}\boldsymbol{\mathcal{G}}_{2}\cdots\times_{1,3}^{2,1}\boldsymbol{\mathcal{G}}_{N}\times_{2}^{1}\boldsymbol{\mathcal{G}}_{N+1}\cdots$
		$\displaystyle\times_{M-1}^{1}\boldsymbol{\mathcal{G}}_{N+M-1}\times_{1,M+1}^{3,1}\boldsymbol{\mathcal{G}}_{N+M}$		(6)

where $\times_{k}^{1}$ and $\times_{k,l}^{1,3}$ are the tensor contraction operations [27]. Consider $\boldsymbol{\mathcal{A}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{k-1}\times I_{k}\times I_{k+1}\times\cdots\times I_{l-1}\times I_{l}\times I_{l+1}\times\cdots\times I_{N}}$ and $\boldsymbol{\mathcal{B}}\in\mathbb{R}^{J_{1}\times J_{2}\times J_{3}}$ with $I_{k}=J_{1}$ and $I_{l}=J_{3}$ , $\boldsymbol{\mathcal{A}}\times_{k}^{1}\boldsymbol{\mathcal{B}}$ yields a $(N+1)$ th-order tensor $\boldsymbol{\mathcal{C}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{k-1}\times I_{k+1}\times\cdots\times I_{N}\times J_{2}\times J_{3}}$ whose entries are calculated by

	$\displaystyle\boldsymbol{\mathcal{C}}(i_{1},\cdots,i_{k-1},$	$\displaystyle i_{k+1},\cdots i_{N},j_{2},j_{3})=$
		$\displaystyle\sum_{i_{k}=1}^{I_{k}}\boldsymbol{\mathcal{A}}(i_{1},\cdots,i_{k},\cdots i_{N})\boldsymbol{\mathcal{B}}(i_{k},j_{2},j_{3}).$		(7)

Similarly, $\boldsymbol{\mathcal{A}}\times_{k,l}^{1,3}\boldsymbol{\mathcal{B}}$ generates a $(N-1)$ th-order tensor $\boldsymbol{\mathcal{D}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{k-1}\times I_{k+1}\times\cdots\times I_{l-1}\times I_{l+1}\times\cdots\times I_{N}\times J_{2}}$ with entries being

	$\displaystyle\boldsymbol{\mathcal{D}}(i_{1},\cdots,i_{k-1},i_{k+1},\cdots,i_{l-1},i_{l+1},\cdots,i_{N},j_{2})=$
	$\displaystyle\sum_{i_{k},i_{l}=1}^{I_{k},I_{l}}\boldsymbol{\mathcal{A}}(i_{1},\cdots,,i_{k},\cdots,i_{l},\cdots i_{N})\boldsymbol{\mathcal{B}}(i_{k},j_{2},i_{l}).$		(8)

It is worth noting that (III-A) is the same as the fully-connected TR network in [17]. To boost the accuracy, we propose to include a nonlinear activation function after each tensor contraction in (III-A), resulting in

	$\displaystyle\boldsymbol{\mathcal{Y}}=$	$\displaystyle f(\cdots f(f(\cdots f(\boldsymbol{\mathcal{X}}\times_{1}^{2}\boldsymbol{\mathcal{G}}_{1})\cdots\times_{1,3}^{2,1}\boldsymbol{\mathcal{G}}_{N})$
		$\displaystyle\times_{2}^{1}\boldsymbol{\mathcal{G}}_{N+1})\cdots\times_{M-1}^{1}\boldsymbol{\mathcal{G}}_{N+M-1})\times_{1,M+1}^{3,1}\boldsymbol{\mathcal{G}}_{N+M}$		(9)

where $f(\cdot)$ signifies the nonlinear activation function, e.g., Tanh in Fig. 1. Note that $f(\cdot)$ is not added after $\boldsymbol{\mathcal{G}}_{N+M}$ since the intrinsic activation function outside the current layer replaces $f(\cdot)$ . Fig. 1 illustrates the fully-connected layer in NTRN.

Refer to caption — Figure 1: Illustrations of fully-connected layer in conventional neural network and nonlinear tensor ring network.

III-B Convolutional Layer Decomposition

At the convolutional layer, an input tensor $\boldsymbol{\mathcal{X}}\in\mathbb{R}^{H\times W\times I}$ is convoluted with a 4th-order kernel tensor $\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times D\times I\times O}$ to output a tensor $\boldsymbol{\mathcal{Y}}\in\mathbb{R}^{\widetilde{H}\times\widetilde{W}\times O}$ , formulated as

\boldsymbol{\mathcal{Y}}=\boldsymbol{\mathcal{W}}*\boldsymbol{\mathcal{X}}

(10)

where $*$ denotes the convolution operation in CNNs. To compress $\boldsymbol{\mathcal{W}}$ , TR decomposition factorizes it into three core tensors, viz. $\boldsymbol{\mathcal{V}}\in\mathbb{R}^{R_{1}\times D\times D\times R_{2}}$ , $\boldsymbol{\mathcal{U}}\in\mathbb{R}^{R_{2}\times I\times R_{3}}$ and $\boldsymbol{\mathcal{\widetilde{U}}}\in\mathbb{R}^{R_{3}\times O\times R_{1}}$ . Therefore, $\boldsymbol{\mathcal{W}}$ is computed by

\displaystyle\boldsymbol{\mathcal{W}}=\sum_{r_{1},r_{2},r_{3}=1}^{R_{1},R_{2},R_{3}}\boldsymbol{\mathcal{V}}(r_{1},:,:,r_{2})\circ\boldsymbol{\mathcal{U}}(r_{2},:,r_{3})\circ\boldsymbol{\mathcal{\widetilde{U}}}(r_{3},:,r_{1})

(11)

It can be known that $\boldsymbol{\mathcal{V}}$ retains convolution kernel, which is different from NTT format. In NTRN, the convolution operation (10) is decomposed into the following procedure:

$\displaystyle\boldsymbol{\mathcal{Z}}_{1}$	$\displaystyle=f(\boldsymbol{\mathcal{X}}\times_{3}^{2}\boldsymbol{\mathcal{U}})$	(12)
$\displaystyle\boldsymbol{\mathcal{Z}}_{2}$	$\displaystyle=f(\boldsymbol{\mathcal{V}}^{p_{1}}*\boldsymbol{\mathcal{Z}}_{1})$	(13)
$\displaystyle\boldsymbol{\mathcal{Y}}$	$\displaystyle=\boldsymbol{\mathcal{Z}}_{2}\times_{1,3}^{3,1}\boldsymbol{\mathcal{\widetilde{U}}}$	(14)

where $\boldsymbol{\mathcal{Z}}_{1}\in\mathbb{R}^{H\times W\times R_{2}\times R_{3}}$ , $\boldsymbol{\mathcal{Z}}_{2}\in\mathbb{R}^{\widetilde{H}\times\widetilde{W}\times R_{1}\times R_{3}}$ and $(\cdot)^{p_{1}}$ is the 1st tensor permutation. The relationship between $\boldsymbol{\mathcal{V}}^{p_{1}}$ and $\boldsymbol{\mathcal{V}}$ obeys

\boldsymbol{\mathcal{V}}^{p_{1}}(d_{1},d_{2},\alpha_{2},\alpha_{1})=\boldsymbol{\mathcal{V}}(\alpha_{1},d_{1},d_{2},\alpha_{2}).

(15)

For the convolutional layer in NTRN, the input tensor $\boldsymbol{\mathcal{X}}$ is first contracted with the intermediate factor $\boldsymbol{\mathcal{U}}$ , then convoluted with the first core $\boldsymbol{\mathcal{V}}$ , and finally contracted with the last core $\boldsymbol{\mathcal{\widetilde{U}}}$ . It is worth mentioning that this procedure is entirely different from the operation in NTT format. On the other hand, when $I$ and $O$ are large, $\boldsymbol{\mathcal{U}}$ and $\boldsymbol{\mathcal{\widetilde{U}}}$ can be further factorized into small-size core tensors for achieving a higher CR. For instance, $\boldsymbol{\mathcal{U}}$ can be reshaped as $\boldsymbol{\mathcal{\widehat{U}}}\in\mathbb{R}^{R_{2}\times I_{1}\times I_{2}\times I_{3}\times R_{3}}$ with $I_{1}\times I_{2}\times I_{3}=I$ , and then $\boldsymbol{\mathcal{\widehat{U}}}$ is factorized into three tensors, namely, $\boldsymbol{\mathcal{G}}_{1}\in\mathbb{R}^{R_{2}\times I_{1}\times R_{22}}$ , $\boldsymbol{\mathcal{G}}_{2}\in\mathbb{R}^{R_{22}\times I_{2}\times R_{23}}$ and $\boldsymbol{\mathcal{G}}_{3}\in\mathbb{R}^{R_{23}\times I_{3}\times R_{3}}$ . To calculate $\boldsymbol{\mathcal{\widehat{U}}}$ from its core tensors, the proposed NTRN adopts a nonlinear contraction operation

\boldsymbol{\mathcal{\widehat{U}}}=f(f(\boldsymbol{\mathcal{G}}_{1}\times_{3}^{1}\boldsymbol{\mathcal{G}}_{2})\times_{4}^{1}\boldsymbol{\mathcal{G}}_{3}).

(16)

The convolutional layer in NTRN is illustrated in Fig. 2.

III-C Compression Ratio

CR is defined as:

\text{CR}=\frac{\text{\# parameters of original network}}{\text{\# parameters of compressed network}}.

(17)

For a fully-connected layer, the number of parameters is $IO$ given a weight matrix $\boldsymbol{W}\in\mathbb{R}^{I\times O}$ . The proposed NTRN requires $(\sum_{n=1}^{N}I_{n}+\sum_{m=1}^{M}O_{m})R^{2}$ parameters with the assumption of $R_{1}=\cdots=R_{M+N}=R$ . On the other hand, the convolutional layer of conventional neural network needs to store $D^{2}IO$ parameters for $\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times D\times I\times O}$ . In NTRN, the dimensions of $\boldsymbol{\mathcal{W}}$ are reshaped into ${D\times D\times I_{1}\times\cdots\times I_{N}\times O_{1}\times\cdots\times O_{M}}$ for a high CR. Hence, only $(\sum_{n=1}^{N}I_{n}+\sum_{m=1}^{M}O_{m}+D^{2})R^{2}$ parameters are required. It is known that decreasing $R$ is able to increase CR.

IV Experimental Results

In this section, the proposed NTRN is evaluated via basic MLP and CNN, LeNet-5 [23], and VGG-11 [24] on MNIST [23], Fashion MNIST [25] and Cifar-10 [26] datasets. All networks are programmed on PyTorch framework [28]. The experiments for the MLP, CNN and LeNet-5 are implemented on Nvidia GTX 2060 GPU, while VGG-11 is performed on Nvidia GTX 2080Ti GPU. In our experiments, TR ranks in an individual layer are the same, but can be diverse in different layers. If the activation function outside compressed layers, and the pooling function in CNNs are not specified, they are set as ReLU and maxpooling (size = 2, stride = 2), respectively.

IV-A Fully-Connected Layer Evaluation

TABLE I: MLP compression on MNIST dataset.

	CR	Acc	Std	H	L	Storage
Original	1x	98.15	0.001	98.3	98.0	5.1 MB
TRN [17]	57x	97.34	0.003	97.7	97.0	0.1 MB
NTRN	57x	97.72	0.002	97.9	97.4	0.1 MB
TRN [17]	359x	95.94	0.004	96.3	95.2	0.02 MB
NTRN	359x	96.51	0.002	96.7	96.3	0.02 MB

We first test the proposed NTRN using a MLP on MNIST dataset. The MLP consists of input, two hidden and output layers with 784, 1024, 512 and 10 nodes, respectively. All images in the MNIST are reshaped as a vector of length $784$ . Therefore, the weight matrices of the MLP are $\boldsymbol{W}_{1}\in\mathbb{R}^{784\times 1024}$ , $\boldsymbol{W}_{2}\in\mathbb{R}^{1024\times 512}$ and $\boldsymbol{W}_{3}\in\mathbb{R}^{512\times 10}$ . To compress them, $\boldsymbol{W}_{1}$ , $\boldsymbol{W}_{2}$ and $\boldsymbol{W}_{3}$ are tensorized, and hence their dimensions are $4\times 7\times 4\times 7\times 4\times 8\times 4\times 8$ , $4\times 8\times 4\times 8\times 8\times 8\times 8$ and $8\times 8\times 8\times 10$ , respectively. Epoch is set as 50 for training. The experimental results based on 20 independent trials are tabulated in Table I, where Acc and Std denote the average accuracy on the test set and the standard deviation of all results, respectively. In addition, the highest and lowest accuracy among all trials are listed in columns H and L, respectively. Moreover, storage indicates the disk space occupied by these parameters. At 57x CR, the ranks of three layers are $\{16,14,8\}$ , while the ranks are set as $\{6,5,5\}$ 359x CR. We can see that the proposed NTRN outperforms TRN [17] at both low and high CRs in terms of the average accuracy, the standard deviation, the highest and lowest accuracy.

IV-B Convolutional Layer Evaluation

We now investigate the compression performance via a CNN with two hidden convolutional and one fully-connected layers on MNIST. The first and second convolutional layers contain 16 and 32 kernels with kernel-size equaling 3, respectively. Moreover, the stride is set to 1, and the padding is 0. Since the dimensions of input data are $28\times 28\times 1$ , the weight tensor at the first convolutional layer is $\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{3\times 3\times 1\times 16}$ . Besides, the weight arrays at the second convolutional and fully-connected layers are expressed as $\boldsymbol{\mathcal{W}}_{2}\in\mathbb{R}^{3\times 3\times 16\times 32}$ and $\boldsymbol{\mathcal{W}}_{3}\in\mathbb{R}^{320\times 10}$ , respectively. To compress weight arrays of the convolutional layers, the dimensions of $\boldsymbol{\mathcal{W}}_{1}$ and $\boldsymbol{\mathcal{W}}_{2}$ are reshaped as $3\times 3\times 1\times 4\times 4$ and $3\times 3\times 4\times 4\times 4\times 8$ , respectively. Following the ablation study, the fully-connected layer in CNN is not compressed. Hence, we do not resize its weight matrix.

The results are shown in Table II. The ranks at 4.3x and 8.9x CRs are $\{2,6\}$ and $\{2,4\}$ , respectively. It is seen that the proposed NTRN attains higher accuracy than TRN at the same CR. While the standard deviation of NTRN is smaller than that of TRN. Note that the storage information is not provided as the fully-connected layer is not compressed.

TABLE II: CNN compression on MNIST dataset.

	CR	Acc	Std	H	L
Original	1x	98.71	0.001	98.9	98.5
TRN [17]	4.3x	98.35	0.003	98.6	98.1
NTRN	4.3x	98.60	0.001	98.9	98.4
TRN [17]	8.9x	97.86	0.003	98.3	97.2
NTRN	8.9x	98.15	0.002	98.4	97.8

IV-C LeNet-5 Evaluation

TABLE III: LeNet-5 compression on Fashion MNIST dataset.

	CR	Acc	Std	H	L	Storage
Original	1x	89.86	0.005	90.7	89.0	2.6 MB
TRN [17]	13x	88.10	0.005	89.0	87.1	0.2 MB
NTRN	13x	88.53	0.004	89.4	87.8	0.2 MB
TRN [17]	72x	87.91	0.006	88.7	87.0	0.04 MB
NTRN	72x	88.47	0.003	88.9	87.7	0.04 MB

With promising results on monotypic layer compression, we conduct experiments using LeNet-5 [23] on the Fashion MNIST dataset to further evaluate the proposed NTRN, where the LeNet-5 is made up of two convolutional and two fully-connected layers. The resized weight tensors from input to output layers are $\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{5\times 5\times 1\times 4\times 5}$ , $\boldsymbol{\mathcal{W}}_{2}\in\mathbb{R}^{5\times 5\times 4\times 5\times 5\times 10}$ , $\boldsymbol{\mathcal{W}}_{3}\in\mathbb{R}^{5\times 10\times 5\times 5\times 8\times 8\times 8}$ and $\boldsymbol{\mathcal{W}}_{4}\in\mathbb{R}^{8\times 8\times 8\times 10}$ . As shown in Table III, the proposed NTRN is superior to the TRN at both 13x and 72x CRs. Herein, the ranks are $\{3,10,30,8\}$ for 13x CR, and $\{3,8,10,5\}$ for 72x CR.

IV-D VGG-11 Evaluation

Furthermore, the developed NTRN is examined by VGG-11 on two datasets, namely, Cifar-10 and Fashion MNIST. The VGG-11 consists of eight convolutional and three fully-connected layers. For the Cifar-10 dataset, the weight tensors from input to output layers are $\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{3\times 3\times 3\times 4\times 4\times 4}$ , $\boldsymbol{\mathcal{W}}_{2}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 2\times 4\times 4\times 4}$ , $\boldsymbol{\mathcal{W}}_{3}\in\mathbb{R}^{3\times 3\times 2\times 4\times 4\times 4\times 4\times 4\times 4\times 4}$ , $\boldsymbol{\mathcal{W}}_{4}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 4\times 4\times 4\times 4\times 4}$ , $\boldsymbol{\mathcal{W}}_{5}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 4\times 4\times 4\times 4\times 4\times 2}$ , $\boldsymbol{\mathcal{W}}_{6}=\boldsymbol{\mathcal{W}}_{7}=\boldsymbol{\mathcal{W}}_{8}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 4\times 2\times 4\times 4\times 4\times 4\times 2}$ , $\boldsymbol{\mathcal{W}}_{9}\in\mathbb{R}^{4\times 4\times 4\times 4\times 2\times 4\times 4\times 4\times 4\times 2}$ , $\boldsymbol{\mathcal{W}}_{10}\in\mathbb{R}^{4\times 4\times 4\times 4\times 2\times 4\times 4\times 4\times 4}$ and $\boldsymbol{\mathcal{W}}_{11}\!\in\!\mathbb{R}^{4\times 4\times 4\times 4\times 10}$ , respectively. Excluding the first weight tensor $\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{3\times 3\times 1\times 4\times 4\times 4}$ for the Fashion MNIST dataset, the dimensions of the others are kept the same as those in Cifar-10. The ranks are set to $\{8,\!25,\!40,\!50,\!70,\!70,\!70,\!70,\!15,\!15,\!5\}$ for 9x CR. By contrast, the ranks are $\{8,\!25,\!30,\!40,\!50,\!50,\!50,\!50,\!10,\!10,\!5\}$ at 17x CR.

TABLE IV: VGG-11 compression on Cifar-10 dataset.

	CR	Acc	Std	H	L	Storage
Original	1x	80.40	0.005	81.0	79.4	36.6 MB
TRN [17]	9x	76.44	0.006	77.2	75.4	4.2 MB
NTRN	9x	77.87	0.002	78.2	77.6	4.2 MB
TRN [17]	17x	75.67	0.005	76.6	75.1	2.2 MB
NTRN	17x	77.50	0.007	78.5	76.3	2.2 MB

Table IV lists the results on the Cifar-10 dataset. It is known that NTRN demonstrates overwhelming superiority compared with TRN. Specifically, the average accuracy by the proposed NTRN increases by 1.4% and 1.8% at 9x and 17x CRs, respectively. The corresponding results for the Fashion MNIST are tabulated in Table V. Compared with TRN, the proposed NTRN effectively mitigates the loss of accuracy caused by compression.

TABLE V: VGG-11 compression on Fashion MNIST dataset.

	CR	Acc	Std	H	L	Storage
Original	1x	92.93	0.002	93.2	92.7	36.7 MB
TRN [17]	9x	91.81	0.002	92.0	91.6	4.2 MB
NTRN	9x	92.02	0.002	92.3	91.7	4.2 MB
TRN [17]	17x	91.41	0.003	91.7	90.8	2.3 MB
NTRN	17x	91.79	0.003	92.2	91.2	2.3 MB

V Conclusion

In this paper, we have proposed a novel network compression technique, termed as NTRN, where the weight arrays at fully-connected and convolutional layers are compressed by TR format. Different from the conventional TRN, a nonlinear activation function is added after tensor contraction and convolution operations inside the compressed layer. The proposed NTRN enables to enhance accuracy, as compared to the state-of-the-art TRN compression method. The superior performance of our NTRN has been verified using image classification task by different DNN’s architectures, such as MLP, LeNet-5 and VGG-11 on MNIST, Fashion MNIST and Cifar-10 datasets. We believe that the proposed NTRN can be potentially used for embedded systems because of its effectiveness to achieve ultra-low memory cost.

References

[1] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, Jan. 2013.
[2] Q. Liu and J. Wu, “Parameter tuning-free missing-feature reconstruction for robust sound recognition,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 1, pp. 78–89, Jan. 2020.
[3] Y. Wang, X. Song, and K. Chen, “Channel and space attention neural network for image denoising,” IEEE Signal Process. Lett., vol. 28, pp. 424–428, Feb. 2021.
[4] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 352–364, Feb. 2017.
[5] Y. Guo, Z. Zhang, Y. Huang, and P. Zhang, “DOA estimation method based on cascaded neural network for two closely spaced sources,” IEEE Signal Process. Lett., vol. 27, pp. 570–574, Apr. 2020.
[6] Y. Li, C. Huang, L. Ding, Z. Li, Y. Pan, and X. Gao, “Deep learning in bioinformatics: Introduction, application, and perspective in the big data era,” Methods, vol. 166, pp. 4–21, Aug. 2019.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, Nevada, USA., Jun. 2016, pp. 770–778.
[8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in Proc. Int. Conf. Learn. Represent., San Juan, Puerto Rico, May 2016.
[9] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv preprint arXiv:1710.09282, 2017.
[10] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Vancouver, BC, Canada, Oct. 2013, pp. 6655–6659.
[11] C. Tai, T. Xiao, Y. Zhang, and X. Wang, “Convolutional neural networks with low-rank regularization,” in Proc. Int. Conf. Learn. Represent., San Juan, Puerto Rico, May 2016.
[12] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, Hawaii, USA, Feb. 2017, pp. 7370–7379.
[13] J. Huang, W. Sun, and L. Huang, “Deep neural networks compression learning based on multiobjective evolutionary algorithms,” Neurocomputing, vol. 378, pp. 260–269, Feb. 2020.
[14] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530, 2015.
[15] M. Astrid and S.-I. Lee, “CP-decomposition with tensor power method for convolutional neural networks compression,” in Proc. IEEE Int. Conf. Big Data Smart Comput., Jeju, South Korea, Feb. 2017, pp. 115–118.
[16] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov, “Ultimate tensorization: Compressing convolutional and FC layers alike,” arXiv preprint arXiv:1611.03214, 2016.
[17] W. Wang, Y. Sun, B. Eriksson, W. Wang, and V. Aggarwal, “Wide compression: Tensor ring nets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, USA, Jun. 2018, pp. 9329–9338.
[18] Q. Zhao, M. Sugiyama, L. Yuan, and A. Cichocki, “Learning efficient tensor representations with ring-structured networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Brighton, UK, May 2019, pp. 8608–8612.
[19] W. Sun, S. Chen, L. Huang, H. C. So, and M. Xie, “Deep convolutional neural network compression via coupled tensor decomposition,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 3, pp. 603–616, Nov. 2020.
[20] J. Zhang and B. Ghanem, “ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 1828–1837.
[21] D. Wang, G. Zhao, H. Chen, Z. Liu, L. Deng, and G. Li, “Nonlinear tensor train format for deep neural network compression,” Neural Netw., vol. 144, pp. 320–333, Dec. 2021.
[22] Y. Pan, J. Xu, M. Wang, J. Ye, F. Wang, K. Bai, and Z. Xu, “Compressing recurrent neural networks with tensor ring for action recognition,” in Proc. AAAI Conf. Artif. Intell., Honolulu, Hawaii, USA, Jul. 2019, pp. 4683–4690.
[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
[24] W. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn., San Diego, CA, USA, May 2015.
[25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[26] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
[27] A. Cichocki, N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic, “Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: Perspectives and challenges part 1,” arXiv preprint arXiv:1609.00893, Sep. 2016.
[28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, “Pytorch: An imperative style, high-performance deep learning library,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 8026–8037, 2019.