This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Response to Reviewers’ Comments

We thank the anonymous reviewers for their helpful comments on the manuscript. Those comments helped us to improve our paper. We list our answers to the reviewers’ comments below.

1 Response to Reviewer #1

Comment 1.

Algorithm 4 might need more details. In particular, it is not clear to see how to delete frequencies in (33). The paper mentions that the top kk frequencies of the power spectrum density for each row are preserved. I have not been able to find how to choose this kk. It is not clear in the numerical experiments either.

Response: We thank the reviewer for this comment. In the revision, we added the variable RR (the number of remaining frequencies) and Equations (33) and (34) in Algorithm 4 to demonstrate how to delete frequencies. Furthermore, below Algorithm 4, we elaborated the process of removing frequencies.

Comment 2.

In both Theorem 1 and Theorem 2, the results are on the convergence to stationary points. The optimal model (7) is constrained. Could the limit points be on the boundary?

Response: Yes. There is no such restriction that limit points cannot be on the boundary. Consider a simple example: 𝐗=(0)\mathbf{X}=(0) and 𝐖\mathbf{W} and 𝐇\mathbf{H} are 1 by 1 matrices with the loss function f(𝐇,𝐖)=𝐗𝐖𝐇F2+λ𝐇^1,Mf(\mathbf{H},\mathbf{W})=\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}+\lambda\lVert\widehat{\mathbf{H}}\rVert_{1,M}. First, (𝐇,𝐖)=(0,0)(\mathbf{H},\mathbf{W})=(0,0) is a stationary point. \because f(0,0)=lim infh0f(hd1,hd2)f(0,0)h=lim infh0h2|d1d2|+hλ|d1|h=λ|d1|0f^{\prime}(0,0)=\liminf\limits_{h\downarrow 0}{f(hd_{1},hd_{2})-f(0,0)\over h}=\liminf\limits_{h\downarrow 0}{h^{2}\left|d_{1}d_{2}\right|+h\lambda\left|d_{1}\right|\over h}=\lambda\left|d_{1}\right|\geq 0 (definition of stationary point in [tseng2001convergence]). Now, consider iterations in Algorithm 1 with 𝐖0=(0),𝐇0=(1)\mathbf{W}_{0}=(0),\mathbf{H}_{0}=(1) and step size γj=12(j+1)\gamma_{j}={1\over 2(j+1)},

𝐇j+12\displaystyle\mathbf{H}_{j+{1\over 2}} =𝐇jγj(2(𝐖jT𝐖j𝐇j𝐖jT𝐗)+(sign((𝐇j^))11)\displaystyle=\mathbf{H}_{j}-\gamma_{j}\left(2(\mathbf{W}^{T}_{j}\mathbf{W}_{j}\mathbf{H}_{j}-\mathbf{W}^{T}_{j}\mathbf{X})+\Re(sign(\Re(\widehat{\mathbf{H}_{j}}))\mathcal{F}_{1}^{-1}\right) (1)
=𝐇jγjsign(𝐇j)\displaystyle=\mathbf{H}_{j}-\gamma_{j}sign(\mathbf{H}_{j}) (2)
𝐇j+1\displaystyle\mathbf{H}_{j+1} =max{0,𝐇j+12}\displaystyle=\max\left\{0,\mathbf{H}_{j+{1\over 2}}\right\} (3)
𝐖j\displaystyle\mathbf{W}_{j} =𝐗𝐇jT(𝐇j𝐇jT)1=0j.\displaystyle=\mathbf{X}{\mathbf{H}_{j}}^{T}(\mathbf{H}_{j}{\mathbf{H}_{j}}^{T})^{-1}=0\quad\forall j. (4)

From these iterations, we can check that (𝐇j,𝐖j)(\mathbf{H}_{j},\mathbf{W}_{j}) converges to the stationary point (0,0)(0,0), which is on the boundary of the (domf)=(0×)\partial(domf)=\partial(\mathbb{R}_{\geq 0}\times\mathbb{R}).

Comment 3.

In the proof of Theorem 1, more details on verifying the conditions of Lemma 1 might be needed. For instance, how the hemivariate condition is satisfied?

Response: We noticed that our proof had a minor error. While the hemivariate condition is generically satisfied, it is not always satisfied. For example, let 𝐇=(0011),𝐖1=(1111)\mathbf{H}=\begin{pmatrix}0&0\\ 1&1\end{pmatrix},\mathbf{W}_{1}=\begin{pmatrix}1&1\\ 1&1\end{pmatrix} and 𝐖2=(1111)\mathbf{W}_{2}=\begin{pmatrix}-1&1\\ -1&1\end{pmatrix}. Consider the line segment between 𝐖1\mathbf{W}_{1} and 𝐖2\mathbf{W}_{2} as 𝐖τ=τ𝐖1+(1τ)𝐖2=(2τ112τ11)\mathbf{W}_{\tau}=\tau\mathbf{W}_{1}+(1-\tau)\mathbf{W}_{2}=\begin{pmatrix}2\tau-1&1\\ 2\tau-1&1\end{pmatrix} for τ[0,1]\tau\in[0,1]. In this case, the Frobenius norm 𝐗𝐖τ𝐇F=𝐗(1111)F\lVert\mathbf{X}-\mathbf{W}_{\tau}\mathbf{H}\rVert_{F}=\lVert\mathbf{X}-\begin{pmatrix}1&1\\ 1&1\end{pmatrix}\rVert_{F} is constant with respect to τ\tau. Therefore, we need to impose additional conditions on 𝐖\mathbf{W}. Note that, in this example, 𝐖f(𝐇,𝐖)=𝐗𝐖𝐇F2\mathbf{W}\mapsto f(\mathbf{H},\mathbf{W})=\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2} does not have a unique minimum. The hemivariate condition is introduced to ensure that the loss function has at most one minimum in each block coordinate. (cf. If f(x1,,xN)f(x_{1},\cdots,x_{N}) is quasi convex and hemivariate with respect to xkx_{k}, then the function xkf(x1,,xN)x_{k}\mapsto f(x_{1},\cdots,x_{N}) has at most one minimum, p.483 in [tseng2001convergence].) To avoid this, we can consider (small) L2L_{2}-regularization of 𝐖\mathbf{W}. More precisely, for every λ>0\lambda>0, since 2f𝐖2=2(𝐇𝐇T+λ𝐈){\partial^{2}f\over\partial\mathbf{W}^{2}}=2(\mathbf{H}\mathbf{H}^{T}+\lambda\mathbf{I}) is positive definite, the function 𝐖𝐗𝐖𝐇F2+λ𝐖F2\mathbf{W}\mapsto\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}+\lambda\lVert\mathbf{W}\rVert_{F}^{2} is strictly convex [grasmair2016basic]. Therefore, it has at most one global minimum.

Revision of Lemma 1 and part of (BCD convergence) in Theorems 1 and 2

From the Reviewer #1’s Comment 3, we noticed that the proofs of BCD convergence in Theorem 1 and Theorem 2 had an error. It can be resolved by introducing (small) L2L_{2}-regularization terms for 𝐖\mathbf{W} and 𝐖\mathbf{W}^{\prime}. Furthermore, we realized that we have proved the sequence (𝐇j,𝐖j,𝐖j)(\mathbf{H}_{j},\mathbf{W}_{j},\mathbf{W}^{\prime}_{j}) converges to a coordinatewise minimum (Nash equilibrium), not a stationary point. In fact, BCD for nonsmooth block optimization does not always converge to a stationary point (see, e.g., [Auslender1976OptimisationM, p.94]).

To address these problems, we have revised Lemma 1. We cited other statements from [tseng2001convergence], which is the same paper referenced in the original Lemma 1. The propositions in [tseng2001convergence] might seem different from our new Lemma 1 at first glance. For example, [tseng2001convergence] introduces terms like {xr}r(N1)mod N\left\{x^{r}\right\}_{r\equiv(N-1)\text{mod }N} under the Essentially cyclic rule. In our case, we iterate 𝐇j,𝐖j\mathbf{H}_{j},\mathbf{W}_{j} and 𝐖j\mathbf{W}^{\prime}_{j} in sequence, not randomly, which satisfies the Essentially cyclic rule. Additionally, there is a difference in notation between our approach and that in [tseng2001convergence] regarding the {xr}r(N1)mod N\left\{x^{r}\right\}_{r\equiv(N-1)\text{mod }N} case. Our single step consists of updating 𝐇,𝐖\mathbf{H},\mathbf{W} and 𝐖\mathbf{W}^{\prime} in sequence, whereas in [tseng2001convergence] – this is considered as three steps. For example, when r=N1,2N1,3N1r=N-1,2N-1,3N-1, it means that our sequence (𝐇,𝐖,𝐖)(\mathbf{H},\mathbf{W},\mathbf{W}^{\prime}) has been updated 1, 2, and 3 times, respectively. In summary, our Lemma 1 is mathematically consistent with the results presented in [tseng2001convergence].

For the reviewer’s convenience, we compare Lemma 1 in the initial submission and in the revision.

Old version of Lemma 1
Let f(θ1,,θN)=f0(θ1,,θN)+j=1Nfj(θj)f(\theta_{1},\cdots,\theta_{N})=f_{0}(\theta_{1},\cdots,\theta_{N})+\sum\limits_{j=1}^{N}f_{j}(\theta_{j}). Suppose that f0,f1,,fNf_{0},f_{1},\cdots,f_{N} satisfy the followings

  1. 1.

    f0f_{0} is continuous.

  2. 2.

    θjf(θ1,,θN)\theta_{j}\mapsto f(\theta_{1},\cdots,\theta_{N}) is quasi convex and hemivariate (not constant on any line segment on the domain).

  3. 3.

    f0,f1,,fNf_{0},f_{1},\cdots,f_{N} are lower semicontinuous.

  4. 4.

    domf0=Y1××YNdomf_{0}=Y_{1}\times\cdots\times Y_{N} for some Yjnj,j=1,,NY_{j}\subseteq\mathbb{R}^{n_{j}},j=1,\cdots,N.

Then, for {θk=(θ1k,,θNk)}k\left\{\theta^{k}=(\theta^{k}_{1},\cdots,\theta^{k}_{N})\right\}_{k\in\mathbb{N}}, {f(θk)}\left\{f(\theta^{k})\right\}\rightarrow-\infty as kk\rightarrow\infty or every cluster point is a coordinatewise minimum point of ff.

New version of Lemma 1
Let f(θ1,,θN)=f0(θ1,,θN)+k=1Nfk(θk)f(\theta^{1},\cdots,\theta^{N})=f_{0}(\theta^{1},\cdots,\theta^{N})+\sum\limits_{k=1}^{N}f_{k}(\theta^{k}). Suppose that f0,f1,,fNf_{0},f_{1},\cdots,f_{N} satisfy the followings

  1. 1.

    The domf0domf_{0} is open and f0f_{0} has a directional derivative in all direction on domf0domf_{0}.

  2. 2.

    The set X0={(θ1,,θN):f(θ1,,θN)f(θ01,,θ0N)}X^{0}=\left\{(\theta^{1},\cdots,\theta^{N}):f(\theta^{1},\cdots,\theta^{N})\leq f(\theta^{1}_{0},\cdots,\theta^{N}_{0})\right\} is compact, and ff is continuous on X0X^{0}.

  3. 3.

    The mapping θkf(θ1,,θN)\theta^{k}\mapsto f(\theta^{1},\cdots,\theta^{N}) has at most one minimum for k2k\geq 2.

Then, for {θk=(θ1k,,θNk)}k\left\{\theta^{k}=(\theta^{k}_{1},\cdots,\theta^{k}_{N})\right\}_{k\in\mathbb{N}}, every cluster point is a stationary point of ff.

Remark related to the convergence at a stationary point: In our loss function, the term with the Frobenius norm is continuously differentiable. In other words, our loss function have a better condition than the condition in original Lemma 1. Continuously differentiable property is used to show that our loss function

minimize 𝐗𝐖𝐇F2+ξ𝐘𝐖𝐇F2+ψ(𝐇)\displaystyle\quad\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}+\xi\lVert\mathbf{Y}-\mathbf{W}^{\prime}\mathbf{H}\rVert_{F}^{2}+\psi(\mathbf{H}) (5)
subject to 𝐖,𝐖d×r and  𝐇0r×T ,\displaystyle\quad\text{$\mathbf{W},\mathbf{W}^{\prime}\in\mathbb{R}^{d\times r}$\text{ and } $\mathbf{H}\in\mathbb{R}^{r\times T}_{\geq 0}$ },

where

  1. 1.

    (Soft frequency regularization) ψ(𝐇)=λ𝐇^1,M\psi(\mathbf{H})=\lambda\lVert\widehat{\mathbf{H}}\rVert_{1,M}  (\triangleright 1,M=\lVert\cdot\rVert_{1,M}=Minkowski 1-norm), or

  2. 2.

    (Hard frequency regularization) ψ(𝐇)=0\psi(\mathbf{H})=0 if 𝐇^\widehat{\mathbf{H}} only uses a prespecified set of frequencies and \infty otherwise.

satisfies condition 1 in the new version of Lemma 1.

Remark related to hemivariate property: We consider arbitrarily fixed positive L2L_{2}-regularization terms for 𝐖\mathbf{W} and 𝐖\mathbf{W}^{\prime}. This condition is used to show conditions 2 and 3 in the new version of Lemma 1. This is because, for any positive L2L_{2} regularization constant, the loss function becomes strictly convex. Therefore, it satisfies condition 3, and condition 2 can be easily demonstrated using the regularization terms of 𝐇,𝐖\mathbf{H},\mathbf{W}, and 𝐖\mathbf{W}^{\prime}. In our experiments, we didn’t consider the L2L_{2}-regularization term, but it can be regarded as if we set the regularization constant to be approximately 0.

Finally, we have made slight adjustments to the part of (BCD convergence) in Theorems 1 and 2 to ensure that they satisfy the conditions in the new version of Lemma 1.

2 Response to Reviewer #2

Comment 1.

In this manuscript, ‘Supervised low-rank semi-nonnegative matrix factorization with frequency regularization for forecasting spatio-temporal data’, the authors introduced novel methods based on Supervised low-rank Semi-Nonnegative Matrix Factorization (SSNMF). The author propose two different optimization procedures, soft and hard regularizations, to achieve semi-nonnegative matrix factorization. Additionally, the authors prove convergence guarantees to first-order stationary points of the corresponding constrained optimization problem. They demonstrate and compares there methods to multiple different datasets and provide very detailed analysis and explainations to these numerical experiments. The paper is well written and formulated, with great analysis, numerical results, and explanations. This manuscript should be considered for publication.

Response: We thank the Reviewer #2 for the positive feedback.

3 Additional changes

  1. 1.

    In the paper, the iteration index has been changed from kk or tt to jj to avoid the confusion with the index for frequency or time notation.

  2. 2.

    We have placed Definition 3 right before Lemma 1. This adjustment was made because, after revising Lemma 1, we thought it would be beneficial to introduce Definition 3 earlier.

  3. 3.

    In Remark 3, we have updated the notation of the loss function \mathcal{L} to ff for the consistency with the previous notation.

  4. 4.

    In Equation (37), we corrected a typo and changed the coefficient from 1T+1{1\over T+1} to 1j+1{1\over j+1}.