Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition

Reviewer 88Da

Comment: 1. The authors did not provide thorough comparisons with existing state-of-the-art model compression methods, such as [1], [2], works in the related work. [1] Anwar, S., Hwang, K., & Sung, W. (2017). Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3), 1-18. [2] He, Y., & Xiao, L. (2023). Structured pruning for deep convolutional neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Response: In the rebuttal, we include comparison results with a leading Integer-only quantization method, and a leading structured pruning approach. Furthermore, we include results of comparing our approach (with fine tuning) vs. applying a low-rank promoting compression method where rank selection is needed during training (with fine-tuning). Here, fine-tuning refers to fine-tuning the parameters of the compressed trained model.

Comment: 2. it is suggested that the authors consider utilizing the ImageNet dataset for validation. The credibility of the experimental results would be significantly enhanced if they were tested against such a large-scale dataset.

Response: Due to time constraints, we have chosen not to run ImageNet experiment. We started the Tiny-ImageNet experiment and will provide the results if completed in time. If the paper is accepted, we will provide these results. We would like to highlight that we validate our approach using CIFAR-100 dataset in Figure 3.

Reviewer HnzA

Comment: 1. The experiments are conducted on relatively small datasets; the current version of the manuscript lacks extensive empirical evaluation.

Response: See our response to Reviewer 88Da comment 2. For other empirical evaluations, [REFER to HYPERLINK AND EXP NUMBER]

Comment: 2. Ablation on the effect of a number of linear compositions is missing; it would be interesting to see the behaviours for $N>3$ .

Response: In the rebuttal, we provide results for $N=4$ for the setting of the FCN-8 model in Table 1. See Experiment 3 in the supplementary link. We note that the test accuracy is 2% less than the one reported for $N=1$ . We remark that while in theory, increasing N indicates an increase in the low-rankness of the appended trained weight, in practice, we are faced with multiple challenges. These include (i) the increased training run-time (although its impact is not very significant as training occurs only once, and the goal of compression is to improve testing time performance), (ii) finding the right weight decay parameter that maintains the same (or very similar) test accuracy, and (iii) the increased non-convexity of the training optimization problem. This trade-off will be included in the revised paper.

Comment: 3. The method is not benchmarked against baseline methods, making it difficult to evaluate the effectiveness of the method w.r.t. to the existing ones.

Response: See our response to Reviewer 88Da comment 1.

Comment: 4. In fig2., results on CIFAR-10 seems to be low. It seems that below 60% of the retained singular values, performance starts to degrade. Prior work has shown that a model trained on CIFAR can be easily pruned up to 90%.

Response: For the GSVT results (Figure 2 (right) - second row), our method maintains a low accuracy drop with only 30% of retained singular values. However, we agree about your observation when using the LSVT method.

In response to comparisons with structured pruning methods, please see our response to Reviewer 88Da - Comment 1.

Comment: 5.What’s the dataset used for Fig 4?

Response: CIFAR10.

Question: 1. ’During training, we employ overparametrization along with weight decay of suitable strength to achieve maximal rank compression in a global manner without sacrificing the model’s capacity’. What do you mean by over-parametrization, and how do you use/employ it?

Response: By overparameterization, we mean that, for every weight matrix, we append linear layers as described in Equation (5). During training, we optimize over the these linear layers before activation. At testing time, the SVD truncation is applied on the product of the trained linear weights.

Question: 2. Why the model in Table 1? trained without augmentation?

Response:. For MNIST, data augmentation is not needed as the test accuracies are high. We use data augmentation in Table 2 to obtain CIFAR10/100 models ( $N=1$ ) with improved test accuracies. The use of data augmentation to improve the test accuracy on CIFAR10/100 dataset is a standard practice (see the caption of Table 6 of the original ResNet paper ‘Deep Residual Learning for Image Recognition’).

Reviewer 1gem

Comment: 1. The proposed method builds on Proposition 4.1, but the language around it is vague in terms of whether there is a theoretical contribution by the authors. Is Proposition 4.1. a corollary of results in Shang et al. (2016) or a similar result with a similar proof? How does the mentioned distinction affect the proof? Any non-trivial differences should be explicitly stated in the paper.

Response: Proposition 4.1 is very similar to Theorems 4 and 5 in Shang et al. (2016). There are no theoretical contributions made by us. The mentioned distinction does not affect the proof. There are no non-trivial differences in the proof, and therefore, we decided not to include it in the paper as it directly follows Shang et al. (2016). Proposition 4.1 is used to theoretically justify why standard weight decay in our method promotes low-rankness for models employing the linear layer composition method used in our paper. We agree with the reviewer that explicitly stating that the proof will not change by our rank assumption is necessary. As such, in the revised paper, we will emphasize this point more clearly.

Comment: 2. Since the method is a training-time method and relies on overparameterization, it significantly increases the training time and limits the usefulness of the proposed method.

Response: We agree that as $N$ increases, the required training time will also increase. However, we would like to highlight that shortening test time for large models holds has a particular importance over shortening training time because training is done either once or infrequently, whereas testing/inference (function evaluation) takes place continuously. Furthermore, user experience is influenced by test time, as deployed compressed models are more commonly used in resource-limited devices.

Comment: 3. Since the main contribution is not theoretical, I would expect the experiments to be more comprehensive (see below). The experiments are also performed on small data sets (MNIST, CIFAR) and using models of very limited capacity. The authors mention that ”it’s crucial to highlight that any post-training compression technique can be applied to a DNN trained with LoRITa”, but while this is technically possible, are there any experimental results indicating that combining such methods with LoRITa yields any additional benefit? I would expect that the authors compare their method to at least some of the mentioned previous low-rank promoted training methods, or to argue more clearly why such a comparison is not of interest.

Response: For evaluation with larger networks such as ImageNet, see our response to Reviewer 88Da - Comment 2.

In response to showing that our method is applicable to any post-training compression, we would like to point out that we illustrate this by utilizing the local and the global singular value truncation. However, we agree that our sentence is not specific. In the revised paper, we will clearly state that any SVT method can be used for LoRITa-trained model.

In response to comparisons with previous arts, please see our response to Reviewer 88Da - Comment 1.

Comment: 5. The authors state that ”compressibility is characterized by a rapid decay in the singular values of W”. Have you verified that there is a significant difference in the decay of the actual singular values of the weight matrices of models learned with LoRITa compared to those learned without?.

Response: In this rebuttal, we provide a plot in link (Experiment 2) to show the faster decay of LoRITa-trained model ( $N>1$ ) vs. the baseline ( $N=1$ ) for FCN8 in Table 1.

Reviewer LSTL

Comment: 1. The empirical evaluation is missing other baselines from the compression literature, even if they are not exactly the same. This could provide readers a better understanding of the behavior of your approach in terms of parameter reduction and accuracy.

Response: Please see our response to Reviewer 88Da - Comment 1.

Comment: 2. In proposition 4.1, you mention the product of matrices R_i, however if they have dimensions MxN, are you talking about an element-wise multiplication? Or are you applying the trick mentioned for the linear layers to go from MxN and then do products of NxN matrices?.

Response: We are not applying element-wise multiplication. We are appending linear layers based on Equation (5). The dimension of the 1st matrix is $M\times N$ . The dimension of $\mathbf{W}^{k}$ , for $k\in\{2,\dots,N\}$ , is $N\times N$ .

Comment: 3. Regarding matrix multiplication, why do you choose to first reduce MxN and then continue the products with NxN, instead of doing products of MxM and then a final layer of MxN?.

Response: There is no particular reason of our choice. Given Proposition 4.1, what you suggested should also work. Please note that we neither assume that $M>N$ nor $M<N$ .

Comment: 4. I’m thinking that your approach could also work well in connection to the lottery ticket hypothesis. In that case, I would expect the rank to reduce even more potentially retaining a lot of the accuracy of the network.

Response: Thanks for your suggestion. We will try to combine our approach (pure compression method) and the lottery ticket hypothesis and provide results/observations in the camera-ready version (if the paper is accepted). WE HAVE YET TO TRY THAT AND CONSIDER IT FOR FUTURE WORK…

Comment: 5. Alternatively, would it make sense to compute the linear layers using FP16 and the do the SVD at FP32? The reason is, given the limited range of the FP16 operations, that would also act as a regularization that could help the SVD as well.

Response: By ”the limited range of the FP16 operations would act as a regularization”, Do you mean quantization may help further reduce the rank? (Please correct us if we understand you wrong). In order to see whether or not quantization help with low-rankness, we designed the following simple experiment. We define a low-rank matrix as $A=LR$ where $L,R^{T}\in\mathbb{R}^{n\times r}$ are generated from standard Gaussian distribution. Then we show how the singular values of $A$ , a randomly pruned version of $A$ (by $90\%$ ), and a coarsely quantized representation of $A$ decay. For a fair comparison, we normalized the spectrums so that the first singular values of all three matrices equal to 1. Experiment 1 in the response supplementary link shows the plots. We see that pruning destroys the low-rankness of $A$ while quantization seems to slightly enhance the low-rankness. We will explore this further.

Reviewer EECw

Comment: 1. Prop. 4.1 and Eq. 8 suggest that as we factorize more, we promote more sparsity since we get $p<1$ norm of the matrix. However, the experiments only show the case of decompositions up to N=3. This may imply the limitation of the proposed approach or a discrepancy between theory and practice.

Response: We agree that increasing $N$ will increase the training time as the number of matrices required for training increases. Furthermore, tuning the hyper-parameters become more challenging as the width of the network increases. However, in theory (as given in Proposition 4.1), the product of the trained linearly composed weights will always exhibit lower rank when compared to standard training ( $N=1$ ). In regard to showing results for $N>3$ , please see our response to Reviewer HnzA - Comment 2.

Comment: 2. Related work discusses a few other compression methods although their methodologies are a bit different from the proposed approach. However, authors failed to compare its performance with other compression methods.

Response: See our response to Reviewer 88Da - Comment 1.

Question: 1. Related work, It’s crucial to highlight that any post-training compression technique can be applied to a DNN trained with LoRITa. Have you tried this to see if you can gain further data reduction without compromising the accuracy?.

Response: Please see the second part of our response to Reviewer 1gem - Comment 3.

Question: 2. I wonder why authors call their proposed method ’the linear layers composition.’ To me, it sounds like adding more layers to DNN, i.e., increasing depth. Weight matrix decomposition or weight matrix factorization sounds better to me.

Response: That is correct, we are increasing the width of the DNN model by appending, per weight matrix, a composition of linear layers (Equation (5)). That is the reason we use this terminology (the linear layers composition). As for the suggested decomposition and factorization, we agree that they also could be used.

Question: 3. After Eq. 8, ”If $p\in(0,1]$ , minimizing the Schatten p-norm encourages low-rankness. A smaller p strengthens the promotion of sparsity by the Schatten p-norm.” This is a very important argument, but citations are missing. Need some justification.

Response: We thank you for pointing us to this. In the revised paper, we will include a citation. For justification, please see Section 2.2 of ‘Low-Rank Matrix Recovery via Efficient Schatten p-Norm Minimization’. We have included a screen shot to the response supplementary link.

Question: 4. The experimental results are presented in terms of the percentage of retained SVs. I wonder if authors also tried a threshold method to keep only SVs whose magnitudes are greater than a certain threshold. If so, can you draw some relationship between the sudden drop of accuracy and the chosen threshold?.

Response: Keeping only SVs whose magnitudes are greater than a certain threshold is exactly what we did. As we vary the cutting-off threshold, the percentage of retained SVs will change monotonically, so we can either use the percentage of retained SVs or the threshold value itself as the x-axis of the figure. The sudden drop of accuracy corresponds to meeting a threshold close to one or more significant singular values.

Other Comment: 13. Sect. 3.1, m is the channel index in image I. Isn’t this the channel index for OUTPUT channel?

Response: Yes. We will fix it in the revised version of the paper.

Other Comment: 14. Eq. 3, Associate Eq. 3’s weight matrices with query, key, and value in the narrative.

Response: [LOOK FOR ATTENTION LAYER LITERATURE TO SEE HOW EQUATION 3 THAT COMBINES THE INPUT WITH THE OUTPUT OF A ONE ATTENTION LAYER…]

Other Comment: 26. pg. 7, using LoRITa with N = 3 achieves no drop in test accuracy =¿ There’s some drop in accuracy if you see Tbl. 1.

Response: We agree that there could be a very small drop (first three rows) or small increase (row 4 and row 6) in accuracy, but it is very marginal. We will re-phrase this sentence in the revised paper.