Review Response for IMAIAI-2022-037:
Zero-Truncated Poisson Regression for Sparse Multiway Count Data Corrupted by False Zeros
We wish to thank again the reviewers and the editor for their input and helpful suggestions. Below you will find responses to each point raised by the reviewers, describing changes to the paper that were made to address them. Most changes to the paper are marked in red. Note that we also changed the formatting of the equation numbering, table captions, and reference style to align with the journal’s specifications.
For the conveinience of the reviewers, we summarize the main changes made to the manuscript:
-
•
The definition of has changed (see eq 1.8 in Theorem 1). This is the term describing the error amplification incurred by our zero-truncated approach (in contrast to the ideal oracle method). This change does not modify the proofs substantially and serves to provide improved error bounds and understating of our proposed methodology.
-
•
To address one of the reviewer’s comments on reproducibility, we added Appendix B, which contains MATLAB code for reproducing the experiments described in Section 2. This Appendix could be moved to a Supplemental Material section if the editors think this is a more appropriate place for the material.
Reviewer 1
-
•
For the applications, authors could mention topic modeling/document classification.
Response: Citations to topic modeling, document clustering, and document classification have been added to the introduction as motivating applications that involve count data. -
•
How do authors enforce the constraints in their algorithm (where is the bounded estimated tensor)? this is unclear. Or maybe it is not enforced explicitly, and only used in the theoretical part?
Response: This constraint is not enforced in the algorithm, only for the theoretical results. To make this clear to the reader, several sentences further clarifying the numerical implementation have been added to the first paragraph of Section 2.2. -
•
Although authors discuss later on in the paper the impact of the choice of , I would suggest to discuss it earlier in the paper as this is a crucial parameter in the proposed bounds. In fact, as a reader, I expected that would be close to zero, but this is not discussed much initially, and some presented results are for large ’s, e.g., Fig 1. Also, is not discussed: one has to check by itself that at , which is not obvious from the formula (as we have an indeterminacy at , ); in particular increases rather fast as goes to zero (e.g., ).
Response: We have added a more precise definition of in Eq. (1.8) and discuss the behavior of as a function of . We also note there that since it is a parameter of a Poisson distribution, thus avoiding any indeterminacy in the expression for . -
•
Also, authors use the squared relative error which is not standard, I suppose is more, which would in some sense improve the bounds as the square root of kappa would be taken.
Response: We agree that the relative error (not squared) is more common. However, as the theory panned out, we find it cleaner to expose the result using the squared counterpart. Otherwise our bounds inevitably involve 1/2 and +-1/4 order exponents and the introductory Theorem 1 arguably becomes more difficult to digest (and the scenario worsens for the exhibition of Theorems 2 and 3). For this reason, we choose to leave the results with the squared relative errors instead for aesthetic reasons. Otherwise, the results themselves do not change and this is only a matter of exposition. Further note, that the square root of kappa is discussed in Section 2.5, since it is indeed more relevant to the relative error. -
•
Authors only consider synthetic data sets for their experiments. It could be nice to add some real-world data experiments, although the paper is mostly theoretically oriented.
Response: We focus on synthetic data in this paper, as the comparisons in the experiments require the true parameters, , to be known. In general, is unknown for real-world data; thus we do not include such results in this paper.
Reviewer 2
-
•
Difference between standard tensor completion and the suggested approach. The authors state in the abstract and the introduction that there is a significant difference between the setting considered in the paper and the setting of tensor completion. The reason is that the locations of the false zeros are not provided. However, the authors’ approach is to disregard all zeros and complete the tensor from the non-zero counts. Thus, it seems that the theoretical results derived by the authors should be considered with regard to other results obtained for tensor completion, or specifically, non-negative tensor completion. This improvement is written in the final paragraph of Sec. 1, but should be clarified. Specifically, the improvement over results obtained in [19,20,21].
Response: We agree with this comment, our work falls within the category of matrix and tensor completion. To address this, we have removed most comments that stress the difference of our setting to the standard tensor completion problem (in the abstract and introduction). Furthermore, Section 1.1 has been modified to discuss this relationship more clearly. Specifically, Section 1.1 points out that our results agree with the sampling complexity of [19,20,21] (which are now citations [22,23,24]) when the CP rank is considered. Our main contribution in regard to these citations is that we show that their proof technique applied to nonnegative CP rank provides a significant improvement in sampling complexity (not exponentially dependent on the number of dimensions ). -
•
Steps to compute a low-rank tensor. The steps suggested in the paper for obtaining the tensor are not unclear. The authors refer to the paper Genaralized CDP Decomposition of tensors that allows for low-rank completion of tensors with general loss function. Is the loss function in this paper a sum of the likelihood of individual elements? Please clarify and write the loss function explicitly. In addition, code for generating at least some results can be beneficial.
Response: We have added Appendix B that lists the MATLAB code that can be used to reproduce the results. This includes the method used for generating and the explicit definitions of the loss functions and gradients that are used in the three methods. -
•
Numerical results. The likelihood function is nonconvex. Did you run the CDP decomposition several times to obtain better results? Or was it sufficient to run it once? The authors state that the expected parameter aligns well with the experiments. However, they don’t justify that statement with results. The experiments were run 50 times. Why not provide a standard deviation for the error? Was the ground truth tensor regenerated for every run? Please make the three curves distinguishable by using different lines/markers for each curve.
Response: We re-ran the experiments using randomly generated initial guesses for each instance of the replicates of per value of . The results were very consistent across both replicates and initial guesses. We have updated Figure 2 and discussions in Sections 2.4 and 2.6 to convey the consistency of the results. The plots of average relative errors now include lines for the bound of multiplied by the average error values of the Oracle method to illustrate that the average errors (+/- one standard deviation) for the ZTP method all fall below the theoretical bound. In addition to the different colors used in the originally submitted plots, we have also added markers for each value of in these plots and used different plot lines for the different methods to help distinguish between the results for the different methods. -
•
Evaluation of . This parameter is significant, as it bounds the error accumulated by disregarding all the zeros in the tensor. The full expression, as presented in Eq. (7) is not very informative. It would be useful to understand, for example, the dependence on (or sparsity level) when (even a simple Taylor expansion may suffice for this).
Response: We have added a more precise definition of in Eq. (1.8) and discuss the behavior of as a function of . We also note there that since it is a parameter of a Poisson distribution, thus avoiding any indeterminacy in the expression for . -
•
Proof of Theorem 1. This is the main result in the paper, but its proof is written off-hand, and is almost impossible to follow and verify. You should have designated proof. Specifically, I would like a clear explanation of the different roles of Theorem 2 and 3 in obtaining Theorem 1.
Response: Apologies for the unclear and off-hand proof of Theorem 1. The manuscript now includes an explicit proof of Theorem 1 (see Section 3, before subsection 3.1). Furthermore, Section 3 also includes an introductory discussion explaining the roles of Theorems 2 and 3 in the proof of Theorem 1. -
•
Exposition for Theorems 2 and 3. Please state the purpose of each theorem. Make the difference between them clear to the reader.
Response: To aid the reader in distinguishing Theorems 2 and 3 and their purpose in the overall work (and statement of Theorem 1), Section 3 now includes an introductory paragraph and paragraphs before Theorem 2 and 3. These paragraphs elaborate details and the purpose of each result, tying back to Section 1 to remind the reader of the overall context. -
•
Proof of Theorem 2. The proof seems interesting, but is very hard to follow. A paragraph stating the different steps or building blocks of the proof (Lemma 5 shows, etc.) would improve the paper.
Response: To help the reader understand the proof of Theorem 2, Section 3.1 now includes a brief discussion of the proof steps, intuition, and roles of the lemmas at the beginning. Furthermore, additional sentences were added to reiterate these steps before and after the statement of Lemmas 4 and 5, and as the proof of Theorem 2 proceeds. -
•
Proof of Theorem 2 - concentration. One of the components of the proof of lemma 5 is Markov’s inequality, which is applied to the supremum of the absolute difference between the likelihood function and its expectation. This results in a fairly slow decay of in the probability of a large deviation. However, it seems that stronger bounds can be applied since the difference is between the value of a smooth function of multiple independent random variables and its expectation. I would expect an exponential decay in such cases. (as was obtained, for example, in [21]). Alternatively, if this is not the case, explain why.
Response: This is a very insightful observation, thank you for pointing this out. This difference in the probability decay is due to the non-subgaussian nature of our Poisson noise. To elaborate, we refer to the book “High-Dimensional Probability” by Roman Vershynin where subgaussian random variables are defined. Examples of subgaussian random variables include Gaussian random variables and these type of random variables produce exponentially decaying probability in large deviations of a smooth function of multiple independent random variables from its expectation. The authors in [21] (now citation [24]) consider Gaussian noise and are able to obtain such strong decay rates. On the other hand, Poisson random variables are not subgaussian and as a consequence we cannot achieve such decay in the probability of our results. The same is true for the work in our references [13] and [23] (which are the same authors from [24] extending their results therein to non-subgaussian type noise). -
•
Notations ,. Please avoid using the regular , for natural and real numbers, as I assume you did in Theorem 2,3, Lemma 5 etc. especially since you used them for the tensor rank and dimensionality, and this might be confusing. Use instead.
Response: This was a mistake when using\mathbb
, thank you for catching this. We have fixed this issue and now and are only used for dimension and rank (resp.) while and are for the sets of natural and real numbers (resp.).