This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newdateformat

ukdate

Nonlinear Tensor Ring Network

Xiao Peng Li, Qi Liu, , Hing Cheung So Xiao Peng Li and Hing Cheung So are with the Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China (e-mail: [email protected]; [email protected]). (Corresponding author: Qi Liu.)Q. Liu is with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore (e-mail: [email protected]).
Abstract

The state-of-the-art deep neural networks (DNNs) have been widely applied for various real-world applications, and achieved significant performance for cognitive problems. However, the increment of DNNs’ width and depth in architecture results in a huge amount of parameters to challenge the storage and memory cost, limiting to the usage of DNNs on resource-constrained platforms, such as portable devices. By converting redundant models into compact ones, compression technique appears to be a practical solution to reducing the storage and memory consumption. In this paper, we develop a nonlinear tensor ring network (NTRN) in which both fully-connected and convolutional layers are compressed via tensor ring decomposition. Furthermore, to mitigate the accuracy loss caused by compression, a nonlinear activation function is embedded into the tensor contraction and convolution operations inside the compressed layer. Experimental results demonstrate the effectiveness and superiority of the proposed NTRN for image classification using two basic neural networks, LeNet-5 and VGG-11 on three datasets, viz. MNIST, Fashion MNIST and Cifar-10.

Index Terms:
Deep neural network, network compression, tensor decomposition, nonlinear tensor ring.

I Introduction

Recently, deep neural networks (DNNs) have attracted increasing attention due to their excellent performance in various fields, such as speech recognition [1, 2], image denoising [3], video categorization [4], object detection [5] and bioinformatics [6]. A pivotal characteristic of the DNNs is large depth since increasing the number of layers is able to improve the representational ability for attaining higher accuracy [7]. The vast depth generates an enormous amount of parameters that requires huge storage and memory space, restricting the DNN deployment on resource-constrained devices, such as mobile phones, wearables and internet of things (IoTs). To that end, network compression technique [8] is proposed to compress DNNs.

In general, the methods for neural network compression can be roughly classified into four categories, namely, parameter pruning and quantization, low-rank factorization, transferred/compact convolution filters, and knowledge distillation [9]. Compared with the others, low-rank factorization is easy to implement for supporting scratch and pre-trained neural networks because of its standardized pipeline. Thereby, in this work, we focus on low-rank factorization that leverages the low-rank property in matrix/tensor to decompose a weight array into multiple small-size factors for reducing the number of parameters. Low-rank matrix decomposition was proposed to compress multi-layer perceptrons (MLPs) [10] and convolutional neural networks (CNNs) [11]. However, the compressed models inevitably suffer from the performance degradation. To alleviate the loss of accuracy at high compression ratio (CR), [12] suggested to combine the low-rank and sparse decomposition to compress DNNs. The joint strategy factorizes a weight matrix into two small-size factor matrices and one sparse matrix based on the pre-trained neural network, and then fine-tunes the matrices to improve performance. In addition, neural network compression was considered as a multi-objective optimization problem in [13], where CR and classification error ratio were optimized simultaneously. Although it can achieve satisfactory tradeoff between CR and classification error ratio, compression artifacts may exist in low-rank matrix decomposition based approaches because they have to reshape the tensorial weights in convolutional layers into a matrix.

To avoid impairing tensor structure, low-rank tensor decomposition is utilized to compress DNNs, and has achieved higher CR and accuracy compared with the counterparts. The prior work to compress DNNs was using Tucker decomposition [14], followed by the CANDECOMP/PARAFAC (CP) decomposition [15] to reduce performance loss at high CR. Besides, tensor train (TT) [16] and tensor ring (TR) [17, 18] models were utilized for neural network compression. They convert the weight array into a high order tensor for a high CR, and then factorize the corresponding tensor into multiple core tensors. Specifically, TR achieved a 18-times (18x) CR with only 0.1% accuracy loss on LeNet-5 model for processing MNIST dataset. Furthermore, Sun etet al.al. [19] leveraged the structure sharing feature in ISTA-Net [20] and ResNet [7] to achieve a high CR with small accuracy loss. Recently, a nonlinear TT (NTT) method [21] was developed to insert a dynamic activation function between two adjacent core tensors to boost performance.

In this work, we aim to leverage low-rank tensor factorization to compress DNNs. It has been demonstrated that TR format is able to achieve higher accuracy than TT decomposition at the same CR [22]. Thereby, we adapt TR factorization to devise a nonlinear TR network (NTRN), where both fully-connected and convolutional layers are compressed by TR factorization. Besides, a nonlinear activation function is added after tensor contraction and convolution operations inside the compressed layer. Different from the NTT format that unfolds the tensorial weights in convolutional layers into a matrix, the proposed NTRN takes action directly on tensorial weights, without distorting the spatial information in convolution kernels. To verify the expressive ability of the proposed NTRN, we test the NTRN with two basic neural networks, LeNet-5 [23] and VGG-11 [24] on MNIST [23], Fashion MNIST [25] and Cifar-10 [26] datasets. The experimental results demonstrate that our NTRN is able to alleviate the accuracy loss compared with the TR based method.

II Preliminaries

Consider a NNth-order tensor 𝓐I1××IN\boldsymbol{\mathcal{A}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}, TT decomposition represents 𝓐\boldsymbol{\mathcal{A}} by two matrices and N2N-2 core tensors, that is

𝓐(i1,i2,,\displaystyle\boldsymbol{\mathcal{A}}(i_{1},i_{2},\!\cdots\!, in)=r1,r2,,rN1=1R1,R2,,RN1𝑮1(i1,r1)𝓖2(r1,i2,r2)\displaystyle i_{n})\!=\!\!\sum_{r_{1},r_{2},\!\cdots\!,r_{N-1}=1}^{R_{1},R_{2},\!\cdots\!,R_{N-1}}\!\boldsymbol{G}_{1}(i_{1},\!r_{1})\boldsymbol{\mathcal{G}}_{2}(r_{1},\!i_{2},\!r_{2})\!\cdots\!
𝓖N1(rN2,iN1,rN1)𝑮N(rN1,iN)\displaystyle\boldsymbol{\mathcal{G}}_{N-1}(r_{N-2},\!i_{N-1},\!r_{N-1})\boldsymbol{G}_{N}(r_{N-1},\!i_{N}) (1)

where 𝓐(i1,i2,,in)\boldsymbol{\mathcal{A}}(i_{1},i_{2},\!\cdots\!,i_{n}) denotes the (i1,i2,,in)(i_{1},i_{2},\!\cdots\!,i_{n}) entry, 𝑮1I1×R1\boldsymbol{G}_{1}\in\mathbb{R}^{I_{1}\times R_{1}}, 𝑮NRN1×IN\boldsymbol{G}_{N}\in\mathbb{R}^{R_{N-1}\times I_{N}}, 𝓖kRk1×Ik×,Rk\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k-1}\times I_{k}\times,R_{k}} with k[2,N1]k\in[2,N-1] are termed as core tensors, and [R1,R2,,RN1][R_{1},R_{2},\cdots,R_{N-1}] are TT ranks. By contrast, TR decomposes 𝓐\boldsymbol{\mathcal{A}} into NN 3rd-order tensors, shown as

𝓐(i1,i2,,\displaystyle\boldsymbol{\mathcal{A}}(i_{1},i_{2},\!\cdots\!, in)=r1,r2,,rN=1R1,R2,,RN𝓖1(r1,i1,r2)𝓖2(r2,i2,r3)\displaystyle i_{n})\!=\!\!\sum_{r_{1},r_{2},\!\cdots\!,r_{N}=1}^{R_{1},R_{2},\!\cdots\!,R_{N}}\!\boldsymbol{\mathcal{G}}_{1}(r_{1},i_{1},\!r_{2})\boldsymbol{\mathcal{G}}_{2}(r_{2},\!i_{2},\!r_{3})\!\cdots\!
𝓖N1(rN1,iN1,rN)𝓖N(rN,iN,r1)\displaystyle\boldsymbol{\mathcal{G}}_{N-1}(r_{N-1},\!i_{N-1},\!r_{N})\boldsymbol{\mathcal{G}}_{N}(r_{N},\!i_{N},r_{1}) (2)

where 𝓖kRk×Ik×Rk+1\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k}\times I_{k}\times R_{k+1}} with k[1,N]k\in[1,N] and [R1,R2,,RN1,RN][R_{1},R_{2},\cdots,R_{N-1},R_{N}] are TR ranks. The main difference between TT and TR factorization is in the first and last core arrays. Specifically, the TT format requires two matrices in the first and last positions, while all core arrays are 3rd-order tensors in the TR factorization. Thereby, the TT decomposition might result in large intermediate cores and small boundary factors, which restricts its representational ability and flexibility. Besides, multiplying TT cores must obey a strict order, and hence the convolution kernel in convolutional layers is unfolded into a vector for performing sequenced operation [16]. This motivates us to devise a nonlinear TR format for neural network compression.

III Nonlinear Tensor Ring Network

In this section, the proposed NTRN to compress fully-connected and convolutional layers is introduced, respectively, in detail.

III-A Fully-Connected Layer Decomposition

A fully-connected layer maps an input vector 𝒙I\boldsymbol{x}\in\mathbb{R}^{I} into an output vector 𝒚O\boldsymbol{y}\in\mathbb{R}^{O} via a weight matrix 𝑾I×O\boldsymbol{W}\in\mathbb{R}^{I\times O}. Mathematically, it is formulated as

𝒚=𝑾T𝒙\displaystyle\boldsymbol{y}=\boldsymbol{W}^{T}\boldsymbol{x} (3)

where ()T(\cdot)^{T} denotes the transpose operator. Neural network compression by low-rank matrix/tensor factorization is to decompose 𝑾\boldsymbol{W} into some small-size factors. To compress 𝑾\boldsymbol{W} in tensor structure, we first reshape 𝑾\boldsymbol{W} into a high order tensor 𝓦I1××IN×O1××OM\boldsymbol{\mathcal{W}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}\times O_{1}\times\cdots\times O_{M}} with

n=1NIn=I,m=1MOm=O.\displaystyle\prod_{n=1}^{N}I_{n}=I,\prod_{m=1}^{M}O_{m}=O. (4)

Then, based on TR format, 𝓦\boldsymbol{\mathcal{W}} is factorized into (N+M)(N+M) 3rd-order tensors

𝓦=r1,,rn=1R1,,RM+N\displaystyle\boldsymbol{\mathcal{W}}=\sum\limits_{r_{1},\cdots,r_{n}=1}^{R_{1},\cdots,R_{M+N}} 𝓖1(r1,:,r2)𝓖2(r2,:,r3)\displaystyle\boldsymbol{\mathcal{G}}_{1}(r_{1},:,r_{2})\circ\boldsymbol{\mathcal{G}}_{2}(r_{2},:,r_{3})\circ\cdots
𝓖M+N(rM+N,:,r1)\displaystyle\circ\boldsymbol{\mathcal{G}}_{M+N}(r_{M+N},:,r_{1}) (5)

where \circ implies the outer product operation, 𝓖kRk×Ik×Rk+1\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k}\times I_{k}\times R_{k+1}} with k[1,N]k\in[1,N], 𝓖kRk×Ok×Rk+1\boldsymbol{\mathcal{G}}_{k}\in\mathbb{R}^{R_{k}\times O_{k}\times R_{k+1}} with k[N+1,M]k\in[N+1,M], and 𝓖k(rk,:,rk+1)\boldsymbol{\mathcal{G}}_{k}(r_{k},:,r_{k+1}) is the vertical fiber of 𝓖k\boldsymbol{\mathcal{G}}_{k}. Moreover, 𝒙\boldsymbol{x} and 𝒚\boldsymbol{y} need to be tensorized as 𝓧I1××IN\boldsymbol{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}} and 𝓨O1××OM\boldsymbol{\mathcal{Y}}\in\mathbb{R}^{O_{1}\times\cdots\times O_{M}}, respectively. Thereby, (3) can be rewritten in the tensor format:

𝓨=\displaystyle\boldsymbol{\mathcal{Y}}= 𝓧×12𝓖1×1,N+12,1𝓖2×1,32,1𝓖N×21𝓖N+1\displaystyle\boldsymbol{\mathcal{X}}\times_{1}^{2}\boldsymbol{\mathcal{G}}_{1}\times_{1,N+1}^{2,1}\boldsymbol{\mathcal{G}}_{2}\cdots\times_{1,3}^{2,1}\boldsymbol{\mathcal{G}}_{N}\times_{2}^{1}\boldsymbol{\mathcal{G}}_{N+1}\cdots
×M11𝓖N+M1×1,M+13,1𝓖N+M\displaystyle\times_{M-1}^{1}\boldsymbol{\mathcal{G}}_{N+M-1}\times_{1,M+1}^{3,1}\boldsymbol{\mathcal{G}}_{N+M} (6)

where ×k1\times_{k}^{1} and ×k,l1,3\times_{k,l}^{1,3} are the tensor contraction operations [27]. Consider 𝓐I1××Ik1×Ik×Ik+1××Il1×Il×Il+1××IN\boldsymbol{\mathcal{A}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{k-1}\times I_{k}\times I_{k+1}\times\cdots\times I_{l-1}\times I_{l}\times I_{l+1}\times\cdots\times I_{N}} and 𝓑J1×J2×J3\boldsymbol{\mathcal{B}}\in\mathbb{R}^{J_{1}\times J_{2}\times J_{3}} with Ik=J1I_{k}=J_{1} and Il=J3I_{l}=J_{3}, 𝓐×k1𝓑\boldsymbol{\mathcal{A}}\times_{k}^{1}\boldsymbol{\mathcal{B}} yields a (N+1)(N+1)th-order tensor 𝓒I1××Ik1×Ik+1××IN×J2×J3\boldsymbol{\mathcal{C}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{k-1}\times I_{k+1}\times\cdots\times I_{N}\times J_{2}\times J_{3}} whose entries are calculated by

𝓒(i1,,ik1,\displaystyle\boldsymbol{\mathcal{C}}(i_{1},\cdots,i_{k-1}, ik+1,iN,j2,j3)=\displaystyle i_{k+1},\cdots i_{N},j_{2},j_{3})=
ik=1Ik𝓐(i1,,ik,iN)𝓑(ik,j2,j3).\displaystyle\sum_{i_{k}=1}^{I_{k}}\boldsymbol{\mathcal{A}}(i_{1},\cdots,i_{k},\cdots i_{N})\boldsymbol{\mathcal{B}}(i_{k},j_{2},j_{3}). (7)

Similarly, 𝓐×k,l1,3𝓑\boldsymbol{\mathcal{A}}\times_{k,l}^{1,3}\boldsymbol{\mathcal{B}} generates a (N1)(N-1)th-order tensor 𝓓I1××Ik1×Ik+1××Il1×Il+1××IN×J2\boldsymbol{\mathcal{D}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{k-1}\times I_{k+1}\times\cdots\times I_{l-1}\times I_{l+1}\times\cdots\times I_{N}\times J_{2}} with entries being

𝓓(i1,,ik1,ik+1,,il1,il+1,,iN,j2)=\displaystyle\boldsymbol{\mathcal{D}}(i_{1},\cdots,i_{k-1},i_{k+1},\cdots,i_{l-1},i_{l+1},\cdots,i_{N},j_{2})=
ik,il=1Ik,Il𝓐(i1,,,ik,,il,iN)𝓑(ik,j2,il).\displaystyle\sum_{i_{k},i_{l}=1}^{I_{k},I_{l}}\boldsymbol{\mathcal{A}}(i_{1},\cdots,,i_{k},\cdots,i_{l},\cdots i_{N})\boldsymbol{\mathcal{B}}(i_{k},j_{2},i_{l}). (8)

It is worth noting that (III-A) is the same as the fully-connected TR network in [17]. To boost the accuracy, we propose to include a nonlinear activation function after each tensor contraction in (III-A), resulting in

𝓨=\displaystyle\boldsymbol{\mathcal{Y}}= f(f(f(f(𝓧×12𝓖1)×1,32,1𝓖N)\displaystyle f(\cdots f(f(\cdots f(\boldsymbol{\mathcal{X}}\times_{1}^{2}\boldsymbol{\mathcal{G}}_{1})\cdots\times_{1,3}^{2,1}\boldsymbol{\mathcal{G}}_{N})
×21𝓖N+1)×M11𝓖N+M1)×1,M+13,1𝓖N+M\displaystyle\times_{2}^{1}\boldsymbol{\mathcal{G}}_{N+1})\cdots\times_{M-1}^{1}\boldsymbol{\mathcal{G}}_{N+M-1})\times_{1,M+1}^{3,1}\boldsymbol{\mathcal{G}}_{N+M} (9)

where f()f(\cdot) signifies the nonlinear activation function, e.g., Tanh in Fig. 1. Note that f()f(\cdot) is not added after 𝓖N+M\boldsymbol{\mathcal{G}}_{N+M} since the intrinsic activation function outside the current layer replaces f()f(\cdot). Fig. 1 illustrates the fully-connected layer in NTRN.

Refer to caption
Figure 1: Illustrations of fully-connected layer in conventional neural network and nonlinear tensor ring network.

III-B Convolutional Layer Decomposition

At the convolutional layer, an input tensor 𝓧H×W×I\boldsymbol{\mathcal{X}}\in\mathbb{R}^{H\times W\times I} is convoluted with a 4th-order kernel tensor 𝓦D×D×I×O\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times D\times I\times O} to output a tensor 𝓨H~×W~×O\boldsymbol{\mathcal{Y}}\in\mathbb{R}^{\widetilde{H}\times\widetilde{W}\times O}, formulated as

𝓨=𝓦𝓧\boldsymbol{\mathcal{Y}}=\boldsymbol{\mathcal{W}}*\boldsymbol{\mathcal{X}} (10)

where * denotes the convolution operation in CNNs. To compress 𝓦\boldsymbol{\mathcal{W}}, TR decomposition factorizes it into three core tensors, viz. 𝓥R1×D×D×R2\boldsymbol{\mathcal{V}}\in\mathbb{R}^{R_{1}\times D\times D\times R_{2}}, 𝓤R2×I×R3\boldsymbol{\mathcal{U}}\in\mathbb{R}^{R_{2}\times I\times R_{3}} and 𝓤~R3×O×R1\boldsymbol{\mathcal{\widetilde{U}}}\in\mathbb{R}^{R_{3}\times O\times R_{1}}. Therefore, 𝓦\boldsymbol{\mathcal{W}} is computed by

𝓦=r1,r2,r3=1R1,R2,R3𝓥(r1,:,:,r2)𝓤(r2,:,r3)𝓤~(r3,:,r1)\displaystyle\boldsymbol{\mathcal{W}}=\sum_{r_{1},r_{2},r_{3}=1}^{R_{1},R_{2},R_{3}}\boldsymbol{\mathcal{V}}(r_{1},:,:,r_{2})\circ\boldsymbol{\mathcal{U}}(r_{2},:,r_{3})\circ\boldsymbol{\mathcal{\widetilde{U}}}(r_{3},:,r_{1}) (11)

It can be known that 𝓥\boldsymbol{\mathcal{V}} retains convolution kernel, which is different from NTT format. In NTRN, the convolution operation (10) is decomposed into the following procedure:

𝓩1\displaystyle\boldsymbol{\mathcal{Z}}_{1} =f(𝓧×32𝓤)\displaystyle=f(\boldsymbol{\mathcal{X}}\times_{3}^{2}\boldsymbol{\mathcal{U}}) (12)
𝓩2\displaystyle\boldsymbol{\mathcal{Z}}_{2} =f(𝓥p1𝓩1)\displaystyle=f(\boldsymbol{\mathcal{V}}^{p_{1}}*\boldsymbol{\mathcal{Z}}_{1}) (13)
𝓨\displaystyle\boldsymbol{\mathcal{Y}} =𝓩2×1,33,1𝓤~\displaystyle=\boldsymbol{\mathcal{Z}}_{2}\times_{1,3}^{3,1}\boldsymbol{\mathcal{\widetilde{U}}} (14)

where 𝓩1H×W×R2×R3\boldsymbol{\mathcal{Z}}_{1}\in\mathbb{R}^{H\times W\times R_{2}\times R_{3}}, 𝓩2H~×W~×R1×R3\boldsymbol{\mathcal{Z}}_{2}\in\mathbb{R}^{\widetilde{H}\times\widetilde{W}\times R_{1}\times R_{3}} and ()p1(\cdot)^{p_{1}} is the 1st tensor permutation. The relationship between 𝓥p1\boldsymbol{\mathcal{V}}^{p_{1}} and 𝓥\boldsymbol{\mathcal{V}} obeys

𝓥p1(d1,d2,α2,α1)=𝓥(α1,d1,d2,α2).\boldsymbol{\mathcal{V}}^{p_{1}}(d_{1},d_{2},\alpha_{2},\alpha_{1})=\boldsymbol{\mathcal{V}}(\alpha_{1},d_{1},d_{2},\alpha_{2}). (15)

For the convolutional layer in NTRN, the input tensor 𝓧\boldsymbol{\mathcal{X}} is first contracted with the intermediate factor 𝓤\boldsymbol{\mathcal{U}}, then convoluted with the first core 𝓥\boldsymbol{\mathcal{V}}, and finally contracted with the last core 𝓤~\boldsymbol{\mathcal{\widetilde{U}}}. It is worth mentioning that this procedure is entirely different from the operation in NTT format. On the other hand, when II and OO are large, 𝓤\boldsymbol{\mathcal{U}} and 𝓤~\boldsymbol{\mathcal{\widetilde{U}}} can be further factorized into small-size core tensors for achieving a higher CR. For instance, 𝓤\boldsymbol{\mathcal{U}} can be reshaped as 𝓤^R2×I1×I2×I3×R3\boldsymbol{\mathcal{\widehat{U}}}\in\mathbb{R}^{R_{2}\times I_{1}\times I_{2}\times I_{3}\times R_{3}} with I1×I2×I3=II_{1}\times I_{2}\times I_{3}=I, and then 𝓤^\boldsymbol{\mathcal{\widehat{U}}} is factorized into three tensors, namely, 𝓖1R2×I1×R22\boldsymbol{\mathcal{G}}_{1}\in\mathbb{R}^{R_{2}\times I_{1}\times R_{22}}, 𝓖2R22×I2×R23\boldsymbol{\mathcal{G}}_{2}\in\mathbb{R}^{R_{22}\times I_{2}\times R_{23}} and 𝓖3R23×I3×R3\boldsymbol{\mathcal{G}}_{3}\in\mathbb{R}^{R_{23}\times I_{3}\times R_{3}}. To calculate 𝓤^\boldsymbol{\mathcal{\widehat{U}}} from its core tensors, the proposed NTRN adopts a nonlinear contraction operation

𝓤^=f(f(𝓖1×31𝓖2)×41𝓖3).\boldsymbol{\mathcal{\widehat{U}}}=f(f(\boldsymbol{\mathcal{G}}_{1}\times_{3}^{1}\boldsymbol{\mathcal{G}}_{2})\times_{4}^{1}\boldsymbol{\mathcal{G}}_{3}). (16)

The convolutional layer in NTRN is illustrated in Fig. 2.

Refer to caption
Figure 2: Illustrations of convolutional layer in conventional neural network and nonlinear tensor ring network.

III-C Compression Ratio

CR is defined as:

CR=# parameters of original network# parameters of compressed network.\text{CR}=\frac{\text{\# parameters of original network}}{\text{\# parameters of compressed network}}. (17)

For a fully-connected layer, the number of parameters is IOIO given a weight matrix 𝑾I×O\boldsymbol{W}\in\mathbb{R}^{I\times O}. The proposed NTRN requires (n=1NIn+m=1MOm)R2(\sum_{n=1}^{N}I_{n}+\sum_{m=1}^{M}O_{m})R^{2} parameters with the assumption of R1==RM+N=RR_{1}=\cdots=R_{M+N}=R. On the other hand, the convolutional layer of conventional neural network needs to store D2IOD^{2}IO parameters for 𝓦D×D×I×O\boldsymbol{\mathcal{W}}\in\mathbb{R}^{D\times D\times I\times O}. In NTRN, the dimensions of 𝓦\boldsymbol{\mathcal{W}} are reshaped into D×D×I1××IN×O1××OM{D\times D\times I_{1}\times\cdots\times I_{N}\times O_{1}\times\cdots\times O_{M}} for a high CR. Hence, only (n=1NIn+m=1MOm+D2)R2(\sum_{n=1}^{N}I_{n}+\sum_{m=1}^{M}O_{m}+D^{2})R^{2} parameters are required. It is known that decreasing RR is able to increase CR.

IV Experimental Results

In this section, the proposed NTRN is evaluated via basic MLP and CNN, LeNet-5 [23], and VGG-11 [24] on MNIST [23], Fashion MNIST [25] and Cifar-10 [26] datasets. All networks are programmed on PyTorch framework [28]. The experiments for the MLP, CNN and LeNet-5 are implemented on Nvidia GTX 2060 GPU, while VGG-11 is performed on Nvidia GTX 2080Ti GPU. In our experiments, TR ranks in an individual layer are the same, but can be diverse in different layers. If the activation function outside compressed layers, and the pooling function in CNNs are not specified, they are set as ReLU and maxpooling (size = 2, stride = 2), respectively.

IV-A Fully-Connected Layer Evaluation

TABLE I: MLP compression on MNIST dataset.
CR Acc Std H L Storage
Original 1x 98.15 0.001 98.3 98.0 5.1 MB
TRN [17] 57x 97.34 0.003 97.7 97.0 0.1 MB
NTRN 57x 97.72 0.002 97.9 97.4 0.1 MB
TRN [17] 359x 95.94 0.004 96.3 95.2 0.02 MB
NTRN 359x 96.51 0.002 96.7 96.3 0.02 MB

We first test the proposed NTRN using a MLP on MNIST dataset. The MLP consists of input, two hidden and output layers with 784, 1024, 512 and 10 nodes, respectively. All images in the MNIST are reshaped as a vector of length 784784. Therefore, the weight matrices of the MLP are 𝑾1784×1024\boldsymbol{W}_{1}\in\mathbb{R}^{784\times 1024}, 𝑾21024×512\boldsymbol{W}_{2}\in\mathbb{R}^{1024\times 512} and 𝑾3512×10\boldsymbol{W}_{3}\in\mathbb{R}^{512\times 10}. To compress them, 𝑾1\boldsymbol{W}_{1}, 𝑾2\boldsymbol{W}_{2} and 𝑾3\boldsymbol{W}_{3} are tensorized, and hence their dimensions are 4×7×4×7×4×8×4×84\times 7\times 4\times 7\times 4\times 8\times 4\times 8, 4×8×4×8×8×8×84\times 8\times 4\times 8\times 8\times 8\times 8 and 8×8×8×108\times 8\times 8\times 10, respectively. Epoch is set as 50 for training. The experimental results based on 20 independent trials are tabulated in Table I, where Acc and Std denote the average accuracy on the test set and the standard deviation of all results, respectively. In addition, the highest and lowest accuracy among all trials are listed in columns H and L, respectively. Moreover, storage indicates the disk space occupied by these parameters. At 57x CR, the ranks of three layers are {16,14,8}\{16,14,8\}, while the ranks are set as {6,5,5}\{6,5,5\} 359x CR. We can see that the proposed NTRN outperforms TRN [17] at both low and high CRs in terms of the average accuracy, the standard deviation, the highest and lowest accuracy.

IV-B Convolutional Layer Evaluation

We now investigate the compression performance via a CNN with two hidden convolutional and one fully-connected layers on MNIST. The first and second convolutional layers contain 16 and 32 kernels with kernel-size equaling 3, respectively. Moreover, the stride is set to 1, and the padding is 0. Since the dimensions of input data are 28×28×128\times 28\times 1, the weight tensor at the first convolutional layer is 𝓦13×3×1×16\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{3\times 3\times 1\times 16}. Besides, the weight arrays at the second convolutional and fully-connected layers are expressed as 𝓦23×3×16×32\boldsymbol{\mathcal{W}}_{2}\in\mathbb{R}^{3\times 3\times 16\times 32} and 𝓦3320×10\boldsymbol{\mathcal{W}}_{3}\in\mathbb{R}^{320\times 10}, respectively. To compress weight arrays of the convolutional layers, the dimensions of 𝓦1\boldsymbol{\mathcal{W}}_{1} and 𝓦2\boldsymbol{\mathcal{W}}_{2} are reshaped as 3×3×1×4×43\times 3\times 1\times 4\times 4 and 3×3×4×4×4×83\times 3\times 4\times 4\times 4\times 8, respectively. Following the ablation study, the fully-connected layer in CNN is not compressed. Hence, we do not resize its weight matrix.

The results are shown in Table II. The ranks at 4.3x and 8.9x CRs are {2,6}\{2,6\} and {2,4}\{2,4\}, respectively. It is seen that the proposed NTRN attains higher accuracy than TRN at the same CR. While the standard deviation of NTRN is smaller than that of TRN. Note that the storage information is not provided as the fully-connected layer is not compressed.

TABLE II: CNN compression on MNIST dataset.
CR Acc Std H L
Original 1x 98.71 0.001 98.9 98.5
TRN [17] 4.3x 98.35 0.003 98.6 98.1
NTRN 4.3x 98.60 0.001 98.9 98.4
TRN [17] 8.9x 97.86 0.003 98.3 97.2
NTRN 8.9x 98.15 0.002 98.4 97.8

IV-C LeNet-5 Evaluation

TABLE III: LeNet-5 compression on Fashion MNIST dataset.
CR Acc Std H L Storage
Original 1x 89.86 0.005 90.7 89.0 2.6 MB
TRN [17] 13x 88.10 0.005 89.0 87.1 0.2 MB
NTRN 13x 88.53 0.004 89.4 87.8 0.2 MB
TRN [17] 72x 87.91 0.006 88.7 87.0 0.04 MB
NTRN 72x 88.47 0.003 88.9 87.7 0.04 MB

With promising results on monotypic layer compression, we conduct experiments using LeNet-5 [23] on the Fashion MNIST dataset to further evaluate the proposed NTRN, where the LeNet-5 is made up of two convolutional and two fully-connected layers. The resized weight tensors from input to output layers are 𝓦15×5×1×4×5\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{5\times 5\times 1\times 4\times 5}, 𝓦25×5×4×5×5×10\boldsymbol{\mathcal{W}}_{2}\in\mathbb{R}^{5\times 5\times 4\times 5\times 5\times 10}, 𝓦35×10×5×5×8×8×8\boldsymbol{\mathcal{W}}_{3}\in\mathbb{R}^{5\times 10\times 5\times 5\times 8\times 8\times 8} and 𝓦48×8×8×10\boldsymbol{\mathcal{W}}_{4}\in\mathbb{R}^{8\times 8\times 8\times 10}. As shown in Table III, the proposed NTRN is superior to the TRN at both 13x and 72x CRs. Herein, the ranks are {3,10,30,8}\{3,10,30,8\} for 13x CR, and {3,8,10,5}\{3,8,10,5\} for 72x CR.

IV-D VGG-11 Evaluation

Furthermore, the developed NTRN is examined by VGG-11 on two datasets, namely, Cifar-10 and Fashion MNIST. The VGG-11 consists of eight convolutional and three fully-connected layers. For the Cifar-10 dataset, the weight tensors from input to output layers are 𝓦13×3×3×4×4×4\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{3\times 3\times 3\times 4\times 4\times 4}, 𝓦23×3×4×4×4×2×4×4×4\boldsymbol{\mathcal{W}}_{2}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 2\times 4\times 4\times 4}, 𝓦33×3×2×4×4×4×4×4×4×4\boldsymbol{\mathcal{W}}_{3}\in\mathbb{R}^{3\times 3\times 2\times 4\times 4\times 4\times 4\times 4\times 4\times 4}, 𝓦43×3×4×4×4×4×4×4×4×4\boldsymbol{\mathcal{W}}_{4}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 4\times 4\times 4\times 4\times 4}, 𝓦53×3×4×4×4×4×4×4×4×4×2\boldsymbol{\mathcal{W}}_{5}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 4\times 4\times 4\times 4\times 4\times 2}, 𝓦6=𝓦7=𝓦83×3×4×4×4×4×2×4×4×4×4×2\boldsymbol{\mathcal{W}}_{6}=\boldsymbol{\mathcal{W}}_{7}=\boldsymbol{\mathcal{W}}_{8}\in\mathbb{R}^{3\times 3\times 4\times 4\times 4\times 4\times 2\times 4\times 4\times 4\times 4\times 2}, 𝓦94×4×4×4×2×4×4×4×4×2\boldsymbol{\mathcal{W}}_{9}\in\mathbb{R}^{4\times 4\times 4\times 4\times 2\times 4\times 4\times 4\times 4\times 2}, 𝓦104×4×4×4×2×4×4×4×4\boldsymbol{\mathcal{W}}_{10}\in\mathbb{R}^{4\times 4\times 4\times 4\times 2\times 4\times 4\times 4\times 4} and 𝓦114×4×4×4×10\boldsymbol{\mathcal{W}}_{11}\!\in\!\mathbb{R}^{4\times 4\times 4\times 4\times 10}, respectively. Excluding the first weight tensor 𝓦13×3×1×4×4×4\boldsymbol{\mathcal{W}}_{1}\in\mathbb{R}^{3\times 3\times 1\times 4\times 4\times 4} for the Fashion MNIST dataset, the dimensions of the others are kept the same as those in Cifar-10. The ranks are set to {8,25,40,50,70,70,70,70,15,15,5}\{8,\!25,\!40,\!50,\!70,\!70,\!70,\!70,\!15,\!15,\!5\} for 9x CR. By contrast, the ranks are {8,25,30,40,50,50,50,50,10,10,5}\{8,\!25,\!30,\!40,\!50,\!50,\!50,\!50,\!10,\!10,\!5\} at 17x CR.

TABLE IV: VGG-11 compression on Cifar-10 dataset.
CR Acc Std H L Storage
Original 1x 80.40 0.005 81.0 79.4 36.6 MB
TRN [17] 9x 76.44 0.006 77.2 75.4 4.2 MB
NTRN 9x 77.87 0.002 78.2 77.6 4.2 MB
TRN [17] 17x 75.67 0.005 76.6 75.1 2.2 MB
NTRN 17x 77.50 0.007 78.5 76.3 2.2 MB

Table IV lists the results on the Cifar-10 dataset. It is known that NTRN demonstrates overwhelming superiority compared with TRN. Specifically, the average accuracy by the proposed NTRN increases by 1.4% and 1.8% at 9x and 17x CRs, respectively. The corresponding results for the Fashion MNIST are tabulated in Table V. Compared with TRN, the proposed NTRN effectively mitigates the loss of accuracy caused by compression.

TABLE V: VGG-11 compression on Fashion MNIST dataset.
CR Acc Std H L Storage
Original 1x 92.93 0.002 93.2 92.7 36.7 MB
TRN [17] 9x 91.81 0.002 92.0 91.6 4.2 MB
NTRN 9x 92.02 0.002 92.3 91.7 4.2 MB
TRN [17] 17x 91.41 0.003 91.7 90.8 2.3 MB
NTRN 17x 91.79 0.003 92.2 91.2 2.3 MB

V Conclusion

In this paper, we have proposed a novel network compression technique, termed as NTRN, where the weight arrays at fully-connected and convolutional layers are compressed by TR format. Different from the conventional TRN, a nonlinear activation function is added after tensor contraction and convolution operations inside the compressed layer. The proposed NTRN enables to enhance accuracy, as compared to the state-of-the-art TRN compression method. The superior performance of our NTRN has been verified using image classification task by different DNN’s architectures, such as MLP, LeNet-5 and VGG-11 on MNIST, Fashion MNIST and Cifar-10 datasets. We believe that the proposed NTRN can be potentially used for embedded systems because of its effectiveness to achieve ultra-low memory cost.

References

  • [1] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, Jan. 2013.
  • [2] Q. Liu and J. Wu, “Parameter tuning-free missing-feature reconstruction for robust sound recognition,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 1, pp. 78–89, Jan. 2020.
  • [3] Y. Wang, X. Song, and K. Chen, “Channel and space attention neural network for image denoising,” IEEE Signal Process. Lett., vol. 28, pp. 424–428, Feb. 2021.
  • [4] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 352–364, Feb. 2017.
  • [5] Y. Guo, Z. Zhang, Y. Huang, and P. Zhang, “DOA estimation method based on cascaded neural network for two closely spaced sources,” IEEE Signal Process. Lett., vol. 27, pp. 570–574, Apr. 2020.
  • [6] Y. Li, C. Huang, L. Ding, Z. Li, Y. Pan, and X. Gao, “Deep learning in bioinformatics: Introduction, application, and perspective in the big data era,” Methods, vol. 166, pp. 4–21, Aug. 2019.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, Nevada, USA., Jun. 2016, pp. 770–778.
  • [8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in Proc. Int. Conf. Learn. Represent., San Juan, Puerto Rico, May 2016.
  • [9] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv preprint arXiv:1710.09282, 2017.
  • [10] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Vancouver, BC, Canada, Oct. 2013, pp. 6655–6659.
  • [11] C. Tai, T. Xiao, Y. Zhang, and X. Wang, “Convolutional neural networks with low-rank regularization,” in Proc. Int. Conf. Learn. Represent., San Juan, Puerto Rico, May 2016.
  • [12] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, Hawaii, USA, Feb. 2017, pp. 7370–7379.
  • [13] J. Huang, W. Sun, and L. Huang, “Deep neural networks compression learning based on multiobjective evolutionary algorithms,” Neurocomputing, vol. 378, pp. 260–269, Feb. 2020.
  • [14] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530, 2015.
  • [15] M. Astrid and S.-I. Lee, “CP-decomposition with tensor power method for convolutional neural networks compression,” in Proc. IEEE Int. Conf. Big Data Smart Comput., Jeju, South Korea, Feb. 2017, pp. 115–118.
  • [16] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov, “Ultimate tensorization: Compressing convolutional and FC layers alike,” arXiv preprint arXiv:1611.03214, 2016.
  • [17] W. Wang, Y. Sun, B. Eriksson, W. Wang, and V. Aggarwal, “Wide compression: Tensor ring nets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, USA, Jun. 2018, pp. 9329–9338.
  • [18] Q. Zhao, M. Sugiyama, L. Yuan, and A. Cichocki, “Learning efficient tensor representations with ring-structured networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Brighton, UK, May 2019, pp. 8608–8612.
  • [19] W. Sun, S. Chen, L. Huang, H. C. So, and M. Xie, “Deep convolutional neural network compression via coupled tensor decomposition,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 3, pp. 603–616, Nov. 2020.
  • [20] J. Zhang and B. Ghanem, “ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 1828–1837.
  • [21] D. Wang, G. Zhao, H. Chen, Z. Liu, L. Deng, and G. Li, “Nonlinear tensor train format for deep neural network compression,” Neural Netw., vol. 144, pp. 320–333, Dec. 2021.
  • [22] Y. Pan, J. Xu, M. Wang, J. Ye, F. Wang, K. Bai, and Z. Xu, “Compressing recurrent neural networks with tensor ring for action recognition,” in Proc. AAAI Conf. Artif. Intell., Honolulu, Hawaii, USA, Jul. 2019, pp. 4683–4690.
  • [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
  • [24] W. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn., San Diego, CA, USA, May 2015.
  • [25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
  • [26] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
  • [27] A. Cichocki, N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic, “Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: Perspectives and challenges part 1,” arXiv preprint arXiv:1609.00893, Sep. 2016.
  • [28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, “Pytorch: An imperative style, high-performance deep learning library,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 8026–8037, 2019.