Response to Reviewers’ Comments

We thank the anonymous reviewers for their helpful comments on the manuscript. Those comments helped us to improve our paper. We list our answers to the reviewers’ comments below.

1 Response to Reviewer #1

Comment 1.: Algorithm 4 might need more details. In particular, it is not clear to see how to delete frequencies in (33). The paper mentions that the top $k$ frequencies of the power spectrum density for each row are preserved. I have not been able to find how to choose this $k$ . It is not clear in the numerical experiments either.

Response: We thank the reviewer for this comment. In the revision, we added the variable $R$ (the number of remaining frequencies) and Equations (33) and (34) in Algorithm 4 to demonstrate how to delete frequencies. Furthermore, below Algorithm 4, we elaborated the process of removing frequencies.

Comment 2.: In both Theorem 1 and Theorem 2, the results are on the convergence to stationary points. The optimal model (7) is constrained. Could the limit points be on the boundary?

Response: Yes. There is no such restriction that limit points cannot be on the boundary. Consider a simple example: $\mathbf{X}=(0)$ and $\mathbf{W}$ and $\mathbf{H}$ are 1 by 1 matrices with the loss function $f(\mathbf{H},\mathbf{W})=\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}+\lambda\lVert\widehat{\mathbf{H}}\rVert_{1,M}$ . First, $(\mathbf{H},\mathbf{W})=(0,0)$ is a stationary point. $\because$ $f^{\prime}(0,0)=\liminf\limits_{h\downarrow 0}{f(hd_{1},hd_{2})-f(0,0)\over h}=\liminf\limits_{h\downarrow 0}{h^{2}\left|d_{1}d_{2}\right|+h\lambda\left|d_{1}\right|\over h}=\lambda\left|d_{1}\right|\geq 0$ (definition of stationary point in [tseng2001convergence]). Now, consider iterations in Algorithm 1 with $\mathbf{W}_{0}=(0),\mathbf{H}_{0}=(1)$ and step size $\gamma_{j}={1\over 2(j+1)}$ ,

$\displaystyle\mathbf{H}_{j+{1\over 2}}$	$\displaystyle=\mathbf{H}_{j}-\gamma_{j}\left(2(\mathbf{W}^{T}_{j}\mathbf{W}_{j}\mathbf{H}_{j}-\mathbf{W}^{T}_{j}\mathbf{X})+\Re(sign(\Re(\widehat{\mathbf{H}_{j}}))\mathcal{F}_{1}^{-1}\right)$	(1)
	$\displaystyle=\mathbf{H}_{j}-\gamma_{j}sign(\mathbf{H}_{j})$	(2)
$\displaystyle\mathbf{H}_{j+1}$	$\displaystyle=\max\left\{0,\mathbf{H}_{j+{1\over 2}}\right\}$	(3)
$\displaystyle\mathbf{W}_{j}$	$\displaystyle=\mathbf{X}{\mathbf{H}_{j}}^{T}(\mathbf{H}_{j}{\mathbf{H}_{j}}^{T})^{-1}=0\quad\forall j.$	(4)

From these iterations, we can check that $(\mathbf{H}_{j},\mathbf{W}_{j})$ converges to the stationary point $(0,0)$ , which is on the boundary of the $\partial(domf)=\partial(\mathbb{R}_{\geq 0}\times\mathbb{R})$ .

Comment 3.: In the proof of Theorem 1, more details on verifying the conditions of Lemma 1 might be needed. For instance, how the hemivariate condition is satisfied?

Response: We noticed that our proof had a minor error. While the hemivariate condition is generically satisfied, it is not always satisfied. For example, let $\mathbf{H}=\begin{pmatrix}0&0\\ 1&1\end{pmatrix},\mathbf{W}_{1}=\begin{pmatrix}1&1\\ 1&1\end{pmatrix}$ and $\mathbf{W}_{2}=\begin{pmatrix}-1&1\\ -1&1\end{pmatrix}$ . Consider the line segment between $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ as $\mathbf{W}_{\tau}=\tau\mathbf{W}_{1}+(1-\tau)\mathbf{W}_{2}=\begin{pmatrix}2\tau-1&1\\ 2\tau-1&1\end{pmatrix}$ for $\tau\in[0,1]$ . In this case, the Frobenius norm $\lVert\mathbf{X}-\mathbf{W}_{\tau}\mathbf{H}\rVert_{F}=\lVert\mathbf{X}-\begin{pmatrix}1&1\\ 1&1\end{pmatrix}\rVert_{F}$ is constant with respect to $\tau$ . Therefore, we need to impose additional conditions on $\mathbf{W}$ . Note that, in this example, $\mathbf{W}\mapsto f(\mathbf{H},\mathbf{W})=\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}$ does not have a unique minimum. The hemivariate condition is introduced to ensure that the loss function has at most one minimum in each block coordinate. (cf. If $f(x_{1},\cdots,x_{N})$ is quasi convex and hemivariate with respect to $x_{k}$ , then the function $x_{k}\mapsto f(x_{1},\cdots,x_{N})$ has at most one minimum, p.483 in [tseng2001convergence].) To avoid this, we can consider (small) $L_{2}$ -regularization of $\mathbf{W}$ . More precisely, for every $\lambda>0$ , since ${\partial^{2}f\over\partial\mathbf{W}^{2}}=2(\mathbf{H}\mathbf{H}^{T}+\lambda\mathbf{I})$ is positive definite, the function $\mathbf{W}\mapsto\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}+\lambda\lVert\mathbf{W}\rVert_{F}^{2}$ is strictly convex [grasmair2016basic]. Therefore, it has at most one global minimum.

Revision of Lemma 1 and part of (BCD convergence) in Theorems 1 and 2

From the Reviewer #1’s Comment 3, we noticed that the proofs of BCD convergence in Theorem 1 and Theorem 2 had an error. It can be resolved by introducing (small) $L_{2}$ -regularization terms for $\mathbf{W}$ and $\mathbf{W}^{\prime}$ . Furthermore, we realized that we have proved the sequence $(\mathbf{H}_{j},\mathbf{W}_{j},\mathbf{W}^{\prime}_{j})$ converges to a coordinatewise minimum (Nash equilibrium), not a stationary point. In fact, BCD for nonsmooth block optimization does not always converge to a stationary point (see, e.g., [Auslender1976OptimisationM, p.94]).

To address these problems, we have revised Lemma 1. We cited other statements from [tseng2001convergence], which is the same paper referenced in the original Lemma 1. The propositions in [tseng2001convergence] might seem different from our new Lemma 1 at first glance. For example, [tseng2001convergence] introduces terms like $\left\{x^{r}\right\}_{r\equiv(N-1)\text{mod }N}$ under the Essentially cyclic rule. In our case, we iterate $\mathbf{H}_{j},\mathbf{W}_{j}$ and $\mathbf{W}^{\prime}_{j}$ in sequence, not randomly, which satisfies the Essentially cyclic rule. Additionally, there is a difference in notation between our approach and that in [tseng2001convergence] regarding the $\left\{x^{r}\right\}_{r\equiv(N-1)\text{mod }N}$ case. Our single step consists of updating $\mathbf{H},\mathbf{W}$ and $\mathbf{W}^{\prime}$ in sequence, whereas in [tseng2001convergence] – this is considered as three steps. For example, when $r=N-1,2N-1,3N-1$ , it means that our sequence $(\mathbf{H},\mathbf{W},\mathbf{W}^{\prime})$ has been updated 1, 2, and 3 times, respectively. In summary, our Lemma 1 is mathematically consistent with the results presented in [tseng2001convergence].

For the reviewer’s convenience, we compare Lemma 1 in the initial submission and in the revision.

Old version of Lemma 1
Let $f(\theta_{1},\cdots,\theta_{N})=f_{0}(\theta_{1},\cdots,\theta_{N})+\sum\limits_{j=1}^{N}f_{j}(\theta_{j})$ . Suppose that $f_{0},f_{1},\cdots,f_{N}$ satisfy the followings

1.

$f_{0}$ is continuous.
2.

$\theta_{j}\mapsto f(\theta_{1},\cdots,\theta_{N})$ is quasi convex and hemivariate (not constant on any line segment on the domain).
3.

$f_{0},f_{1},\cdots,f_{N}$ are lower semicontinuous.
4.

$domf_{0}=Y_{1}\times\cdots\times Y_{N}$ for some $Y_{j}\subseteq\mathbb{R}^{n_{j}},j=1,\cdots,N$ .

Then, for $\left\{\theta^{k}=(\theta^{k}_{1},\cdots,\theta^{k}_{N})\right\}_{k\in\mathbb{N}}$ , $\left\{f(\theta^{k})\right\}\rightarrow-\infty$ as $k\rightarrow\infty$ or every cluster point is a coordinatewise minimum point of $f$ .

New version of Lemma 1
Let $f(\theta^{1},\cdots,\theta^{N})=f_{0}(\theta^{1},\cdots,\theta^{N})+\sum\limits_{k=1}^{N}f_{k}(\theta^{k})$ . Suppose that $f_{0},f_{1},\cdots,f_{N}$ satisfy the followings

1.

The $domf_{0}$ is open and $f_{0}$ has a directional derivative in all direction on $domf_{0}$ .
2.

The set $X^{0}=\left\{(\theta^{1},\cdots,\theta^{N}):f(\theta^{1},\cdots,\theta^{N})\leq f(\theta^{1}_{0},\cdots,\theta^{N}_{0})\right\}$ is compact, and $f$ is continuous on $X^{0}$ .
3.

The mapping $\theta^{k}\mapsto f(\theta^{1},\cdots,\theta^{N})$ has at most one minimum for $k\geq 2$ .

Then, for $\left\{\theta^{k}=(\theta^{k}_{1},\cdots,\theta^{k}_{N})\right\}_{k\in\mathbb{N}}$ , every cluster point is a stationary point of $f$ .

Remark related to the convergence at a stationary point: In our loss function, the term with the Frobenius norm is continuously differentiable. In other words, our loss function have a better condition than the condition in original Lemma 1. Continuously differentiable property is used to show that our loss function

	minimize	$\displaystyle\quad\lVert\mathbf{X}-\mathbf{W}\mathbf{H}\rVert_{F}^{2}+\xi\lVert\mathbf{Y}-\mathbf{W}^{\prime}\mathbf{H}\rVert_{F}^{2}+\psi(\mathbf{H})$		(5)
	subject to	$\displaystyle\quad\text{$\mathbf{W},\mathbf{W}^{\prime}\in\mathbb{R}^{d\times r}$\text{ and } $\mathbf{H}\in\mathbb{R}^{r\times T}_{\geq 0}$ },$

where

1.

(Soft frequency regularization) $\psi(\mathbf{H})=\lambda\lVert\widehat{\mathbf{H}}\rVert_{1,M}$ ( $\triangleright$ $\lVert\cdot\rVert_{1,M}=$ Minkowski 1-norm), or
2.

(Hard frequency regularization) $\psi(\mathbf{H})=0$ if $\widehat{\mathbf{H}}$ only uses a prespecified set of frequencies and $\infty$ otherwise.

satisfies condition 1 in the new version of Lemma 1.

Remark related to hemivariate property: We consider arbitrarily fixed positive $L_{2}$ -regularization terms for $\mathbf{W}$ and $\mathbf{W}^{\prime}$ . This condition is used to show conditions 2 and 3 in the new version of Lemma 1. This is because, for any positive $L_{2}$ regularization constant, the loss function becomes strictly convex. Therefore, it satisfies condition 3, and condition 2 can be easily demonstrated using the regularization terms of $\mathbf{H},\mathbf{W}$ , and $\mathbf{W}^{\prime}$ . In our experiments, we didn’t consider the $L_{2}$ -regularization term, but it can be regarded as if we set the regularization constant to be approximately $0$ .

Finally, we have made slight adjustments to the part of (BCD convergence) in Theorems 1 and 2 to ensure that they satisfy the conditions in the new version of Lemma 1.

2 Response to Reviewer #2

Comment 1.: In this manuscript, ‘Supervised low-rank semi-nonnegative matrix factorization with frequency regularization for forecasting spatio-temporal data’, the authors introduced novel methods based on Supervised low-rank Semi-Nonnegative Matrix Factorization (SSNMF). The author propose two different optimization procedures, soft and hard regularizations, to achieve semi-nonnegative matrix factorization. Additionally, the authors prove convergence guarantees to first-order stationary points of the corresponding constrained optimization problem. They demonstrate and compares there methods to multiple different datasets and provide very detailed analysis and explainations to these numerical experiments. The paper is well written and formulated, with great analysis, numerical results, and explanations. This manuscript should be considered for publication.

Response: We thank the Reviewer #2 for the positive feedback.

3 Additional changes

1.

In the paper, the iteration index has been changed from $k$ or $t$ to $j$ to avoid the confusion with the index for frequency or time notation.
2.

We have placed Definition 3 right before Lemma 1. This adjustment was made because, after revising Lemma 1, we thought it would be beneficial to introduce Definition 3 earlier.
3.

In Remark 3, we have updated the notation of the loss function $\mathcal{L}$ to $f$ for the consistency with the previous notation.
4.

In Equation (37), we corrected a typo and changed the coefficient from ${1\over T+1}$ to ${1\over j+1}$ .