Responses to Reviewers’ Comments

We thank the reviewers for their comments. We have addressed all of them to our best abilities in the revised manuscript; major textual changes are highlighted in blue.

Reply to Associate Editor

Comment:

“With all due respect, it seems that the authors are confused by the mathematical terminology. Namely, different results used in ANY paper can be called Facts / Claims / Propositions / Lemmas (Lemmata) and, finally, Theorems. These terms all refer to TRUE and (maybe) IMPORTANT results but they play different roles. Namely, any result that is called a Theorem without name of person like Stokes’ Theorem or citation is by default claimed to be proven by the authors and to be the main or one on the main contributions of the article. If the authors claim that Theorem 1 is the main result and it has never been known before then the paper must be rejected. This result is definitely known. The authors might look it up in any basic Linear Algebra course. However, if the authors just want to mention it and use as an auxiliary result, they have to call it a Fact/Property/Lemma. For example, Schur’s Lemma is an extremely important and even central result in representation theory but everybody calls it lemma. From here we see that being a lemma doesn’t mean the result is unimportant.”

I think the critique is justified. The proof has two parts. The first part is indeed a standard result: the graph Laplacian is PSD, and conjugation of a PSD matrix gives a PSD matrix. The second part ”GGLR Pr $(\alpha)$ in (22) promotes planar signal reconstruction” is not a proper mathematical statement since ”promotes” is a loose term without a proper definition. My suggestion is to separate this into two parts; cite the first part as a Proposition/Lemma and cite a book/paper where this can be found; finally, bring in the second part as a separate remark (or alternatively give a proper mathematical statement that does not rely on “promotes…”).

Response: We thank the AE for the summary comments. For the first part of Theorem 1, we have added a reference to a standard linear algebra book by Golub and Van Loan [Golub2013MatrixC]. For the second part of Theorem 1, we have already provided a definition of “promote” in footnote 1 in the previous version. In this revised version, we further clarified this terminology in the same footnote 1 as:

By “promote”, we mean adding a non-negative signal prior $\text{Pr}({\mathbf{x}})$ to a minimization objective $f({\mathbf{x}})$ to bias the solution ${\mathbf{x}}$ where $\text{Pr}({\mathbf{x}})=0$ . See [13] for a comprehensive overview on bias-variance tradeoffs.

Similarly argued in our response to the AE in the previous review round, we stress here again that bias-variance tradeoff is a very well studied topic in machine learning / signal processing literature [Belkin2018ReconcilingMM]. Saying “promote” and “bias” in the context of a minimization problem is not a “proper mathematical statement” is entirely inconsistent with existing literature [yankelevsky16], [mahmood18], [liu19depth], [cheung21].

To further clarify what we mean by “promote” in the special case when the prior is our proposed GGLR, we reworded Theorem 1 as:

Gradient-induced nodal graph (GNG) Laplacian ${\mathcal{L}}$ is PSD with all planar signals as the $0$ -frequencies, and thus GGLR ${\mathbf{x}}^{\top}{\mathcal{L}}{\mathbf{x}}$ in (22) promotes planar signal reconstruction, i.e., planar signals ${\mathbf{x}}$ compute to ${\mathbf{x}}^{\top}{\mathcal{L}}{\mathbf{x}}=0$ .

In our specific case, “promote” here means biasing the solution towards planar signals that are $0$ -frequencies of the GNG Laplacian matrix ${\mathcal{L}}$ . We sincerely hope that these revisions are up to the AE’s personal standard of mathematical precision.

Lastly, we humbly ask the AE to read not only the reviewers’ comments, but our arduously composed responses to the reviewers carefully and thoroughly also before rendering an editorial decision. We thank the AE for his/her tremendous effort throughout the multiple review rounds.

Reply to Reviewer 2

Comment 1:

With all due respect, it seems that the authors are confused by the mathematical terminology. Namely, different results used in ANY paper can be called Facts/ Claims/ Propositions/ Lemmas(Lemmata) and, finally, Theorems. These terms all refer to TRUE and (maybe) IMPORTANT results but they play different roles. Namely, any result that is called a Theorem without name of person like Stokes’ Theorem or citation is by default claimed to be proven by the authors and to be the main or one on the main contributions of the article. If the authors claim that Theorem 1 is the main result and it has never been known before then the paper must be rejected. This result is definitely known. The authors might look it up in any basic Linear Algebra course. However, if the authors just want to mention it and use as an auxiliary result, they have to call it a Fact/Property/Lemma. For example, Schur’s Lemma is an extremely important and even central result in representation theory but everybody calls it lemma. From here we see that being a lemma doesn’t mean the result is unimportant.

If the authors insists that Theorem 1 is their theorem (they never cite a source) then the paper must be rejected.

Response 1: With due respect to the reviewer, we are not sure this discussion is productive in the context of a paper review; we should be discussing the content of lemmas and theorems rather than the definitions of lemmas and theorems. Nonetheless, for the sake of completeness, we have looked up the Merriam-Webster dictionary, and “lemma” is defined as

An auxiliary proposition used in the demonstration of another proposition¹¹1https://www.merriam-webster.com/dictionary/lemma.

In contrast, a “theorem” is defined as

A formula, proposition, or statement in mathematics or logic deduced or to be deduced from other formulas or propositions²²2https://www.merriam-webster.com/dictionary/theorem.

These definitions are consistent with our understanding and usage in our manuscript, namely, lemma 1 and 2 are foundational pieces we first introduced from which Theorem 1 is built later.

The reviewer appears to focus exclusively on the first part of Theorem 1 stating that the GNG Laplacian is positive semi-definite (PSD) (note that a reference to a standard linear algebra book by Golub and Van Loan [Golub2013MatrixC] is added). However, the second part of the Theorem stating that planar signals are the $0$ frequencies of the GNG Laplacian—and thus the GGLR signal prior is promoting (biasing the solution towards) planar signal reconstruction—is entirely neglected. If the reviewer insists this part is also known in the literature, we humbly ask the reviewer to provide a reference.

Finally, we note again that through multiple rounds of reviews, we have received technical comments only about Theorem 1 by this reviewer, while the vast majority of the paper has received no comments. We humbly ask the reviewer again to read the entire manuscript and provide a balanced critique of the whole work.

Comment 2: For the Ridge Regression issue, i am pasting here a link to Wikipedia page https://en.wikipedia.org/wiki/Ridge $\_$ regression. The title of this link speaks for itself. This page among other facts and concepts is talking about EXACTLY your situation. They call it Tikhonov regression and it might be viewed as a type or close relative of Ridge Regression.

In any case, if the authors insist that they wont cite the source and want still to call it a Theorem, the paper must be rejected.

Response 2: We first humbly remind the reviewer that, in the first review round, the reviewer wrote that our equation (31) of Theorem 3 is the same as “ridge regression”. As we have already explained in our response 3 to reviewer 2 in the first round, Theorem 3 (equation (1) below) is not the same as ridge regression (equation (2) below):

	$\displaystyle{\mathbf{x}}^{*}$	$\displaystyle=\left({\mathbf{H}}^{\top}{\mathbf{H}}+\mu{\mathcal{L}}\right)^{-1}{\mathbf{H}}^{\top}{\mathbf{y}}$		(1)
	$\displaystyle\hat{\beta}_{R}$	$\displaystyle=\left({\mathbf{X}}^{\top}{\mathbf{X}}+\lambda{\mathbf{I}}\right)^{-1}{\mathbf{X}}^{\top}{\mathbf{y}}.$		(2)

As anyone can plainly see, the two equations are not the same when GNG Laplacian matrix ${\mathcal{L}}$ in (1) (which is positive semi-definite (PSD)) is not the same as the identity matrix ${\mathbf{I}}$ in (2) (which is positive definite (PD)). The important implication is that coefficient matrix ${\mathbf{X}}^{\top}{\mathbf{X}}+\lambda{\mathbf{I}}$ in (2) is guaranteed to be PD and invertible, while ${\mathbf{H}}^{\top}{\mathbf{H}}+\mu{\mathcal{L}}$ is not. This necessitates the need for Theorem 3 and proof in Appendix C, from which we have received no comments from the reviewer in two review rounds.

In this review round, the reviewer now doubled down and stated that our Theorem 3 is the same as “Tikhonov regression”. First, there is no such thing as “Tikhonov regression”; we conjecture that the reviewer means “Tikhonov regularization” instead. Second, it is known that the reviewer’s cited Wiki-page is incorrect³³3See a discussion of this Wiki-page error here. https://stats.meta.stackexchange.com/questions/4305/is-tikhonov-regularization-really-a-synonym-for-ridge-regression in stating that “ridge regression” is the same as “Tikhonov regularization”. Instead, Tikhonov regularization is

\displaystyle\hat{{\mathbf{x}}}=({\mathbf{A}}^{\top}{\mathbf{A}}+{\bm{\Gamma}}^{\top}{\bm{\Gamma}})^{-1}{\mathbf{A}}^{\top}{\mathbf{b}}.

(3)

Thus, Tikhonov regularization is a super-set that includes ridge regression as a special case when matrix ${\bm{\Gamma}}={\mathbf{I}}$ (see equation (2.1) in [Hoerl2000RidgeRB] for ridge regression and equation (3) in [Golub1999TikhonovRA] for Tikhonov regression). Decomposing PSD GNG Laplacian ${\mathcal{L}}$ into ${\mathcal{L}}={\bm{\Gamma}}^{\top}{\bm{\Gamma}}$ , indeed the equation (3) in Theorem (1) can be viewed as a special case of Tikhonov regularization. But so is previously proposed graph Laplacian reglarizer (GLR) [pang17], ${\mathbf{x}}^{\top}{\mathbf{L}}{\mathbf{x}}$ , and graph shift variation (GSV) [romano17], $\|({\mathbf{I}}-{\mathbf{A}}){\mathbf{x}}\|^{2}_{2}$ , and left eigenvectors of the random walk graph Laplacian (LeRaG) [liu17], ${\mathbf{x}}^{\top}{\mathbf{L}}_{rw}^{\top}{\mathbf{L}}_{rw}{\mathbf{w}}$ (where ${\mathbf{L}}_{rw}$ is the random walk Laplacian), and so on. Each of these specialized Tikhonov regularizations has its own spectral properties, and thus promotes (bias the solutions towards) different signal reconstruction. So, what exactly is the problem here? In our case, we proved in Theorem 1 that GGLR promotes (bias the solution towards) planar signal reconstruction (i.e., planar signals are the $0$ frequencies of GNG Laplacian ${\mathcal{L}}$ ). We also provided detailed analysis in Theorem 3 specifying under precisely what conditions when matrix ${\mathbf{H}}^{\top}{\mathbf{H}}+\mu{\mathcal{L}}$ is full-rank and invertible. To the best of our knowledge, these theorems do not exist previously in the literature. Instead of an unreliable wiki-page, we humbly ask the reviewer for a precise reference if we are incorrect.

Reply to Reviewer 3

Comment 1: I thank the authors for the revised manuscripts where they have attempted to answer many of my comments and questions. The manuscript is much more easy to follow, and specially some of the definitions are more explicit than before.

Response 1: We thank the reviewer for the positive comment.

Comment 2: The authors now provide a definition of a K-dimensional planar signal on page 4. However, we are not still told what the relation between the the different $\alpha$ ’s of the plane can be, specially in connection to it being a graph signal.

More importantly, we are led to next the definition of a [piecewise planar signal, which forms the crux of the work. However, it still remains unclear what a general piece-wise planar signal means. Yes, the authors have provided an example of a 1D linear signal. But how this extends to even the next dimension is unclear. Basically, what defined a notion of piecewise planar - what is meant by ‘adjacent sub-domains’? The authors state ‘If one further assumes signal …then the two adjunct linear pieces must coincide’. How is adjacency defined, even qualitatively? And what does it mean to coincide? Please consider clarifying these concepts.

Response 2a: Putting aside piecewise planar notion and focusing on the simpler planar notion, it appears that the reviewer is already confused between a continuous hyperplane and discrete samples on a hyperplane that compose a graph signal. Specifically, in the previous manuscript, we provided a definition of a continuous $K$ -dimensional hyperplane (equation (6) in the revised manuscript), which is common in the literature. $\alpha$ ’s are the standard parameters that define the hyperplane—like slope $m$ and $y$ -intercept $b$ for a one-dimensional linear line $y=mx+b$ —parameter vector ${\bm{\alpha}}$ is the manifold gradient. A collection of discrete samples residing on the same hyperplane is a planar discrete signal; connecting neighboring samples via edges results in a planar graph signal. This is very clearly illustrated in Fig. 3(a) in the revised manuscript, where there are four discrete samples on the same 2D plane. This seems rather straightforward, and we failed to understand the reviewer’s confusion.

Response 2b: The notion of neighboring sub-domains or sub-graphs is common in the graph literature—see, for example, nodal domain theorem [Bykoglu2005NodalDT]. For the sake of completeness, we provide a definition here and in the revised manuscript. In our context, we define a sub-domain ${\mathcal{N}}_{m}$ as a maximally connected subset of nodes $i$ such that the corresponding discrete samples $x_{i}$ reside on the same hyperplane. We assume that nodal domains are non-overlapping. Two nodal domains ${\mathcal{N}}_{m}$ and ${\mathcal{N}}_{n}$ are adjacent if there exist at least one node $i\in{\mathcal{N}}_{n}$ and one node $j\in{\mathcal{N}}_{m}$ such that $\exists(i,j)\in{\mathcal{E}}$ . We failed to understand the reviewer’s confusion, since piecewise constant (PWC) graph signal—where samples in each sub-domain evaluate to the same constant—is well discussed in the GSP literature [pang17, liu17], and PWP graph signal—where samples in each sub-domain reside on the same hyperplane—is just a natural extension.

Comment 3: In Response to my comment 3, the authors attempt to demonstrate the fallacy of the argument’ - I am not sure what the fallacy is referred to here. Also, the authors demonstrate that $({\mathbf{x}}-{\mathbf{A}}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{x}}-{\mathbf{A}}{\mathbf{x}})$ is not small by considering a particular example of a pathological graph which is a line graph whose starting and end nodes have only one connection, the rest all have two neighbours. The entire arguments falls short if this is instead considered for a path graph or a ring graph. So it is not right to say that my argument is entirely fallacious.

Further, the authors choose a particular signal x=[2, 0, -1], instead if you consider a graph of larger number of nodes, but the same linear signal, say running from -2 to 2 in steps of 100 (100 nodes), the measured quantity again will be zero only at the two extreme nodes, and in this is similar to the edge-effects well known in discrete signal processing. I do not think this refutes the validity of my suggestion of using $({\mathbf{x}}-{\mathbf{A}}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{x}}-{\mathbf{A}}{\mathbf{x}})$ as a meaningful way to enforce smoothness of the derivative or the residual leading to piecewise planar regularisation.

Further in the response 3, the authors mention ‘which agrees with our intuition for a linear line’ What is a linear line?

Also they state ‘The reason we can compute a more reasonable regularizer for a linear line is because we computed manifold gradients w.r.t. latent coordinates that are defined globally’ - this was not clear - is it reasonable because you used latent coordinates that had nothing to do with the actual graph?

Again, after all the newer additions, the motivation for the PWP and indeed the formulation itself seems ill-posed.

Response 3a: First, we are puzzled that the reviewer continued to insist on $({\mathbf{x}}-{\mathbf{A}}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{x}}-{\mathbf{A}}{\mathbf{x}})$ as a reasonable regularizer to promote some notion of graph linearity when a counter-example was already provided in our previous response. A theorem is proven false when a single counter-example is given; counter-intuitively, the reviewer seems content with a proposal that may work only part of the time (and remains unclear when it would work and to what extent in general). In particular, what constitute “extreme nodes” in a general graph is very unclear: given a finite graph with nodes of varying node degrees, what are the “extreme nodes” according to the reviewer? For a general graph then, how does the reviewer propose to account for “edge-effects” without first correctly identifying the so-called “extreme nodes”? This is hardly sound algorithm design as an engineering principle. In contrast, our GGLR proposal to promote planar / PWP graph signals, based on fast computation of graph embedding to obtain latent coordinates for each node, does not require identification of “extreme nodes” and works every time.

Moreover, in the previous revised manuscript we have already demonstrated definitively why using $({\mathbf{L}}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}{\mathbf{x}})$ to promote signal linearity, where ${\mathbf{L}}{\mathbf{x}}$ is used to compute graph gradients, is not reasonable. The reviewer’s suggestion of ${\mathbf{x}}^{\top}({\mathbf{I}}-{\mathbf{A}})^{\top}{\mathbf{L}}({\mathbf{I}}-{\mathbf{A}}){\mathbf{x}}=({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ , where normalized graph Laplacian ${\mathbf{L}}_{n}={\mathbf{D}}^{-1/2}{\mathbf{L}}{\mathbf{D}}^{-1/2}$ is used to compute graph gradient ${\mathbf{L}}_{n}{\mathbf{x}}$ instead of ${\mathbf{L}}{\mathbf{x}}$ , is just a normalized variant of the same idea. We have already indicated in a footnote (footnote 9 in the revised manuscript) that a similar argument can be made for other versions of graph Laplacians. Nonetheless, for the sake of completeness, we now show in detail why this normalized variant is also not reasonable. (This is included in Appendix A in the revised manuscript.)

We begin by showing that the only signal that computes to $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})=0$ is ${\mathbf{x}}={\mathbf{v}}_{1}=\text{diag}({\mathbf{D}}^{1/2})$ , where ${\mathbf{v}}_{1}$ is the (unnormalized) first eigenvector corresponding to eigenvalue $0$ for ${\mathbf{L}}_{n}$ . First, it is obvious that $({\mathbf{L}}_{n}{\mathbf{v}}_{1})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{v}}_{1})=0$ , since by definition of eigenvector ${\mathbf{v}}_{1}$ corresponding to eigenvalue $\lambda_{1}=0$ , ${\mathbf{L}}_{n}{\mathbf{v}}_{1}={\mathbf{D}}^{-1/2}{\mathbf{L}}{\mathbf{D}}^{-1/2}\text{diag}({\mathbf{D}}^{1/2})={\mathbf{D}}^{-1/2}{\mathbf{L}}{\mathbf{1}}={\mathbf{0}}$ . Second, in order for ${\mathbf{y}}^{\top}{\mathbf{L}}{\mathbf{y}}=0$ where ${\mathbf{y}}\neq{\mathbf{0}}$ , ${\mathbf{y}}$ must be the constant signal $c{\mathbf{1}}$ that is the (unnormalized) first eigenvector corresponding to eigenvalue $0$ of ${\mathbf{L}}$ . In general, any vector ${\mathbf{x}}=\sum_{n=1}^{N}\alpha_{n}{\mathbf{v}}_{n}$ can be expressed as a linear combination of eigenvectors $\{{\mathbf{v}}_{n}\}$ of ${\mathbf{L}}_{n}$ that are orthogonal basis vectors spanning the signal space ${\mathcal{H}}=\mathbb{R}^{N}$ , where $\alpha_{i}=\langle{\mathbf{x}},{\mathbf{v}}_{i}\rangle={\mathbf{v}}_{i}^{\top}{\mathbf{x}}$ . We now write

	$\displaystyle{\mathbf{L}}_{n}{\mathbf{x}}$	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\left(\sum_{i=2}^{N}\lambda_{i}{\mathbf{v}}_{i}{\mathbf{v}}_{i}^{\top}\right)\left(\sum_{n=1}^{N}\alpha_{n}{\mathbf{v}}_{n}\right)$		(4)
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\sum_{i=2}^{N}\lambda_{i}\alpha_{i}{\mathbf{v}}_{i}$		(5)

where in $(a)$ we write symmetric matrix ${\mathbf{L}}_{n}$ as a linear combination of rank-1 matrices that are outer-products ${\mathbf{v}}_{i}{\mathbf{v}}_{i}^{\top}$ of its eigenvectors ${\mathbf{v}}_{i}$ , each scaled by eigenvalue $\lambda_{i}$ . $(b)$ follows since eigenvectors of symmetric matrix ${\mathbf{L}}_{n}$ are orthogonal, i.e., ${\mathbf{v}}_{i}^{\top}{\mathbf{v}}_{j}=\delta_{i-j}$ . Note importantly that there is no ${\mathbf{v}}_{1}{\mathbf{v}}_{1}^{\top}$ in the summation, since $\lambda_{1}=0$ .

Thus, in order for ${\mathbf{y}}^{\top}{\mathbf{L}}{\mathbf{y}}=0$ , ${\mathbf{y}}$ must equal $c{\mathbf{1}}={\mathbf{L}}_{n}{\mathbf{x}}=\sum_{i=2}^{N}\lambda_{i}\alpha_{i}{\mathbf{v}}_{i}$ . However, we see that inner-product $\langle c{\mathbf{1}},{\mathbf{v}}_{1}\rangle={\mathbf{v}}_{1}^{\top}c{\mathbf{1}}=c\sum_{i}v_{1,i}=c\sum_{i}d_{i}^{1/2}$ , where $d_{i}>0$ is the degree of node $i$ for a positive connected graph ${\mathcal{G}}$ . Thus, $\sum_{i}d_{i}^{1/2}>0$ , and $c\sum_{i}d_{i}^{1/2}\neq 0$ for $c\neq 0$ . Given $\langle c{\mathbf{1}},{\mathbf{v}}_{1}\rangle\neq 0$ , constant signal $c{\mathbf{1}}$ cannot be written as $c{\mathbf{1}}=\sum_{i=2}^{N}\lambda_{i}\alpha_{i}{\mathbf{v}}_{i}$ , except for the case when $c=0$ and $\alpha_{i}=0,\forall i$ . We can therefore conclude that the only signal ${\mathbf{x}}$ such that $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})=0$ is ${\mathbf{x}}={\mathbf{v}}_{1}$ , the first eigenvector of ${\mathbf{L}}_{n}$ . However, this means that the dominant signal (lowest frequency) $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ is promoting is exactly the same as ${\mathbf{x}}^{\top}{\mathbf{L}}_{n}{\mathbf{x}}$ . Hence, the reviewer’s suggested $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ promotes signal construction that is essentially no different from existing GLR ${\mathbf{x}}^{\top}{\mathbf{L}}_{n}{\mathbf{x}}$ using a normalized graph Laplcian ${\mathbf{L}}_{n}$ .

Further, it is clear that even the constant signal ${\mathbf{x}}=c{\mathbf{1}}$ , which has zero gradient and is thus smooth and linear in any reasonable definition of signal linearity, does not compute to zero, i.e., $0\neq({\mathbf{L}}_{n}c{\mathbf{1}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}c{\mathbf{1}})$ . We can thus concclude that $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ is not a reasonable regularizer to promote graph signal linearity.

Refer to caption — Figure 1: Image patch denoising using three different priors: SDGLR, SDGGLR, and the reviewer’s suggested prior $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ . We observe that the suggested prior is noticeably worse than the proposed SDGGLR.

We have conducted an experiment on image denoising to demonstrate that the reviewer’s suggested regularizer $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ did indeed perform noticeably worse than our proposed SDGGLR, as shown in Fig. 1 here, agreeing with our understanding. We see that in the second image, the reviewer’s suggestion performed even worse than previous SDGLR with noticeable false contours due to its inability to promote even a simple constant signal.

Response 3b: Since the reviewer explicitly asked “what is a linear line”, a concept studied in elementary school, we provide a definition below for the sake of completeness. A one-dimensional function $f(x)$ is a linear line with respect to independent variable $x$ if $f(x)$ can be written as $f(x)=ax+b$ , where $a,b\in\mathbb{R}$ are the slope and $y$ -intercept of the line, respectively. We refer the reviewer to this Wiki page⁴⁴4https://en.wikipedia.org/wiki/Linear_equation for details. We are puzzled by this comment as well, since the reviewer appears to understand this concept when discussing “a graph of larger number of nodes but the same linear signal” earlier.

Response 3c: We vehemently disagree with the reviewer’s claim that the latent coordinates “had nothing to do with the actual graph”. The latent coordinates of a graph are computed via our proposed fast graph embedding method with linear time ${\mathcal{O}}(N)$ complexity described in Section VI. Graph embedding—a well studied topic in the machine learning literature [LLE, LE] for over a decade—computes coordinates in a low-dimensional space, where Euclidean distances between computed coordinates correspond to node-to-node network distances in the original graph. Saying our computed latent coordinates “had nothing to do with the actual graph” shows a lack of basic understanding of what a graph embedding actually is. We humbly ask the reviewer to peruse overview papers in graph embedding such as [hamilton17, xu21] for reference.

Response 3d: In summary, in this response 3, the reviewer has i) insisted on a ill-advised regularizer $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ that promotes the same dominant signal construction as conventional GLR ${\mathbf{x}}^{\top}{\mathbf{L}}_{n}{\mathbf{x}}$ and hence does not promote any new notion of “graph linearity” (and experimentally performed noticeably worse than our proposed GGLR); ii) questioned what a linear line is—a concept taught in elementary school; and iii) demonstrated a lack of basic understanding of what a graph embedding is. In totality, with due respect to the reviewer, this does not amount to a technically sound argument why our formulation, described in Section IV for signals with coordinates and Section VI for signals without coordinates, is “ill-posed”. We humbly ask the reviewer to reread Section IV and VI carefully for a basic understanding of our formulation.

Comment 4: In response 4, they mention ‘As discussed in the Introduction and shown in new Fig.1, many graph signals are slowly varying across the underlying kernels, and promoting PWC signal reconstruction is clearly not suitable. This motivates a generalization to promote PWP signal reconstruction instead ’ — what are these kernels? Alternatively, given the example in Figure (b) could you visualise the benefits of your approach over regular GLR? Given that the work is for general graph signals and it is a mostly empirical work, it is not sufficient to just lift the motivation from image setting and make it a motivation for graphs.

Response 4: In the revised manuscript, we have reconstructed Fig. 1 to show the denoising results of temperature signals on a China geographical graph (kNN graph for $k=5$ ). It consists of 3191 nodes, and each node represents a regional district in China. The data comes from Loess Plateau SubCenter, National Earth System Science Data Center, National Science & Technology Infrastructure of China⁵⁵5http://loess.geodata.cn. Kindly note that this is not an image setting, and the graph is not a regular kernel such as a 2D grid. In Fig. 1 of the revised manuscript (Fig. 2 here), we can clearly observe the “staircase” effect—artificial signal discontinuities between adjacent reconstructed constant sub-domains—for SDGLR. In contrast, our proposed SDGGLR produced a much smoother signal reconstruction and higher PSNR.

Comment 5: Response 5: ‘There exist many practical graph signals that are slowly varying across the underlying kernels with occasional discontinuities; Again, as I mention above, then why don’t the authors give us concrete examples of such graph signals instead of referring to TV and TGC that are used in a 2D setting which has also visual or perceptual motivation, whereas this is not the case in graphs? Why is stair-case effect relevant for graphs? Does it affect the compression of graph signals, for example?

Also, a question that I had in mind from the discussion on K-planar signals: Does the eigenvectors basis, K-of them in particular, form a suitable basis for the K-planar signal space? In that case, can you connect your work to a K-band-limited graph signal setting?

Further, given a general setting of a graph and graph signals, how do I judge if the latent coordinates obtained by your approach in Section VI is ‘good’? This is particularly for settings where the graph is not a 2D image…

Response 5a: For motivation, in Fig. 1 of the revised manuscript (Fig. 2 here) we have already included an example of a non-imaging signal that is slowly varying across the underlying graph kernel—temperature signal on a China geographical graph. Staircase effect is an undesirable artifact that introduces artificial signal discontinuities across adjacent reconstructed constant sub-domains, while the ground-truth signal is smooth—the problem is not unique to images. This is clearly demonstrated in Fig. 2(c) for temperature signal in China, in Fig. 3(d) for a synthetic smooth signal on a manifold graph, and Fig. 4(c) for a 3D point cloud, when conventional SDGLR was used. These artifacts would appear for the graph version of TV called graph total variation (GTV) as well [elmoataz08, couprie13]. These observable staircase artifacts are not only visually unpleasant; they lead to erroneous signal reconstructions, resulting in lower PSNR (larger MSE) compared to our proposed SDGGLR. We are unsure what is the relevance of staircase effects to signal compression, which is outside the scope of this paper focusing on signal restoration.

Response 5b: The whole point of the proposed GGLR ${\mathbf{x}}^{\top}{\mathcal{L}}{\mathbf{x}}$ using new matrix GNG Laplacian ${\mathcal{L}}$ is that any planar graph signal ${\mathbf{x}}$ with discrete samples residing on the same hyperplane will evaluate to ${\mathbf{x}}^{\top}{\mathcal{L}}{\mathbf{x}}=0$ ; in other words, planar signals are the $0$ frequencies of PSD GNG Laplacian matrix ${\mathcal{L}}$ . In term of eigen-analysis, all planar signals reside in a $K$ -dimensional eigen-subspace corresponding to the smallest eigenvalue $\lambda_{1}=0$ of ${\mathcal{L}}$ with multiplicity $K$ . This is analogous to the constant signals that are the $0$ frequencies of the PSD graph Laplacian matrix ${\mathbf{L}}$ ; here, the 1-dimensional eigen-subspace spanned by the constant vector ${\mathbf{1}}$ corresponds to the smallest eigenvalue $\lambda_{1}=0$ of ${\mathbf{L}}$ with multiplicity $1$ . In the revised manuscript, we have made this point explicit in the Introduction and in Theorem 1 to facilitate understanding.

Note that by formulating a MAP optimization problem of the following form (equation (30) in the revised manuscript) for signal reconstruction:

\displaystyle\min_{\mathbf{x}}\|{\mathbf{y}}-{\mathbf{H}}{\mathbf{x}}\|^{2}_{2}+\mu\,{\mathbf{x}}^{\top}{\mathcal{L}}{\mathbf{x}}

(6)

we are only biasing the signal reconstruction towards low frequencies of ${\mathcal{L}}$ , instead of a strict bandlimited signal reconstruction as done in bandlimited sampling, for example [anis16]. See [cheung18] for details of general MAP formulations and associated spectral interpretations for graph signal reconstruction.

Response 5c: The reviewer is essentially asking about quality evaluation of any computed graph embedding. In the graph embedding literature [hamilton17, xu21], the quality of a computed graph embedding (coordinates for each graph node) is evaluated for specific graph applications. We have conducted extensive experiments to test our embedding method in our earlier conference version of this work [Chen2022DSLW], focusing exclusively on graph clustering. We compared it with representative state-of-art embedding methods: LLE [LLE], LE [LE], DeepWalk [DeepWalk], node2vec [node2vec], and NetWalk [NetWalk]. To evaluate clustering accuracy, we used four criteria: Rand Index (RI), Precision, Purity, and Normalized Mutual Information (NMI) [Alok2013DevelopmentOA]. Experiments show that our embedding is among the fastest in the literature, while producing the best clustering performance for manifold graphs. Due to page limitation, here we only compared our embedding method for GGLR in the experimental section with two representative embedding methods, LLE and LE. We observe that our embedding method achieves the best interpolation performance for manifold graphs.

Comment 6: Response 7: the authors state that in ‘For graphs that do not naturally have latent coordinates as attributes, and they are not manifold graphs, then reasonable latent coordinates cannot be easily obtained using our proposed method’ it would be nice to clarify what reasonable is meant, and how the user can judge it for a new dataset - for example if all I have access to is graph and signals on graph, how do I judge if the manifold representation you present is ‘reasonable’ or effective? In the end, we are talking about a basis representation of graph signals, and then the eigenspace of the laplacian or GGLR or SDGLR forms the basis for expressing the signals (PWP, for example).

Response 6a: For the question of “when is a graph a manifold graph?”, we have already provided a criterion in the paper in Section VI, which we repeat here for completeness. Given a graph composed of nodes uniformly sampled from a smooth manifold, the betweenness centrality of nodes should be similar, i.e., all nodes are equally likely to appear in a given shortest path. Thus, we first divide each $C_{B}(i)$ by the number of $(s,t)$ pairs, $(N-1)(N-2)$ , and then employ the variance of betweenness centrality (VBC) as a metric to evaluate the quality of a manifold graph. Only qualified manifold graphs are inputted to our algorithm to compute embeddings. As shown in Table II, the first four graphs with smaller VBCs are considered as qualified manifold graphs.

Response 6b: As previously discussed, the eigen-subspace corresponding to the smallest eigenvalue $\lambda_{1}=0$ of multiplicity $K$ is the set of planar graph signals. This is precisely why ${\mathbf{x}}^{\top}{\mathcal{L}}{\mathbf{x}}=0$ for all planar signals ${\mathbf{x}}$ ,

Comment 7: Detailed response 9: isn’t that also true for the regular graph Laplacian?

Response 7: First, it should be clear that this is not true for the reviewer’s proposed prior $({\mathbf{L}}_{n}{\mathbf{x}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{x}})$ , since $({\mathbf{L}}_{n}{\mathbf{1}})^{\top}{\mathbf{L}}({\mathbf{L}}_{n}{\mathbf{1}})\neq 0$ , and thus the reviewer’s proposed prior is not invariant to constant shifts.

Second, it is indeed true also for the regular combinatorial graph Laplacian ${\mathbf{L}}={\mathbf{D}}-{\mathbf{W}}$ , which is obvious since the first (unnormalized) eigenvector ${\mathbf{v}}_{1}$ corresponding to eigenvalue $0$ is the constant all-one vector ${\mathbf{v}}_{1}={\mathbf{1}}$ , i.e., ${\mathbf{L}}{\mathbf{1}}={\mathbf{0}}$ . Thus, it is very clear that a constant-shifted vector ${\mathbf{1}}+c{\mathbf{1}}=(1+c){\mathbf{1}}$ is still constant, and thus ${\mathbf{L}}(1+c){\mathbf{1}}=(1+c){\mathbf{L}}{\mathbf{1}}={\mathbf{0}}$ .

For the case of 1-dimensional GNG Laplacian ${\mathcal{L}}$ , the dimension of the eigen-subspace for eigenvalue $0$ is two, i.e., ${\mathbf{v}}_{1}\perp{\mathbf{v}}_{2}$ and ${\mathcal{L}}{\mathbf{v}}_{1}={\mathcal{L}}{\mathbf{v}}_{2}={\mathbf{0}}$ . In this case, it is less obvious whether a constant shift of any vector ${\mathbf{u}}=a_{1}{\mathbf{v}}_{1}+a2{\mathbf{v}}_{2}$ in this eigen-subspace automatically means ${\mathcal{L}}({\mathbf{u}}+c{\mathbf{1}})={\mathbf{0}}$ . We provided a proof here.

Comment 8: Detailed response 14 and in Remark under Theorem 2: What is 1D GNG is not clearly defined (It appears first on page 8), is this the same as 1-dimensional PWP, with K=1? This was a bit unclear. Then is it the case that the signal at every node is of the form ( ${\mathbf{x}}_{i}=a+b{\mathbf{p}}_{i}$ ) ? Is this what is meant by a linear signal referred to in the response and Remark2? Please clarify.

Response 8: Due to the limited space available (only 16 pages are permitted for a revised TSP manuscript, including references), we have removed the discussion on 1D GNG Laplacian in the revised manuscript.

Comment 9: Detailed response 18: Agreed, but the true value of optimal tradeoff is itself a function of the true ${\mathbf{x}}^{o}$ as can be seen from equation (33) - this seems a bit circular to me in that case.

Also, doesn’t modelling $\lambda_{i}$ as an exponential function implicitly mean that it is a low-pass graph signal? Then isn’t the following analysis restricted to that particular case of graph signals? On the other hand, the house signal does indeed have edges, and hence is not a smooth GS - I am a bit confused here. I request the authors to clarify this part.

Response 9a: First, computing the optimal tradeoff between the fidelity term and a regularizer term in general is a well-studied topic in machine learning and signal processing—called the bias-variance tradeoff [Belkin2018ReconcilingMM]. Specifically for graph signal denoising and using GLR for regularization, it was investigated previously in [chen2017bias]. We are simply extending this analysis to the case when the regularizer is GGLR instead of GLR. Specifically, we only need to compute $\rho_{N}\approx\lambda_{n}{\mathbf{v}}_{N}^{\top}{\mathbf{y}}$ in equation (34), without knowledge of the groun truth signal ${\mathbf{x}}^{o}$ . This is clarified in the revised manuscript.

Response 9b: Modeling $\lambda_{i}$ ’s as an exponential function assumes only the distribution of eigenvalues (graph frequencies) for the underlying graph, NOT the graph signal. Specifically, it does not imply a signal residing on the graph kernel is either low-pass or high-pass.

Comment 10: Response 21: Thanks for adding the section on Manifold graph. However, it still remains vague and unclear. For example, what does roughly uniformly distributed samples on a K-dimensional continuous manifold mean? What is the distribution assumption? What is roughly? Is the definition your own or has it been used earlier, in which case kindly cite some of the relevant literature to help the reader. Also, the two properties that you assume that manifold graph: are they themselves the definition of the manifold graph you desire?

Response 10: Graph is naturally a discrete object, and connecting it to a continuous manifold via the common “manifold assumption” is shockingly common in the machine learning literature [coifman06, hein06]. For example, assuming that graph nodes are projected samples chosen randomly from a uniform distribution over a manifold domain, [coifman06] showed that the graph Laplacian matrix converges to the continuous Laplace-Beltrami operator at an estimated rate as the number of points approaches infinity and the distances between points go to zero. Under the same uniform sampling assumption, [hein06] showed that a graph Laplacian regularization term converges to a continuous penalty functional. Our notion of manifold graph is no different from these earlier works; our innovation comes from efficiently computing a graph embedding given only a manifold graph, and using the computed latent coordinates to promote planar / PWP signal reconstruction via the proposed GNG Laplacian matrix ${\mathcal{L}}$ .

Further, the two assumptions we made for a manifold graph are not unusual in the graph embedding literature and have been made previously in kNN graph for ordinal embedding [Terada2014LocalOE, Cucuringu2015OrdinalEO].

Comment 11: Detailed response 27, thank you for pointing out the exceedingly common nature of the objective. My comment was merely to point out that people, such as myself, who do not solely work on image quality assessment do not find SSIM as a staple metric to understand it when just simply quoted in abbreviation the passing. Same is the case with PSNR.

Response 11: We thank the reviewer for this comment. PSNR and SSIM are indeed the most common image quality metrics in the image processing literature for well over two decades, and we have made effort to explain this basic fact in our revised TSP manuscript.

Comment 12: Detailed Response 30: I agree, what indeed is the central thesis of the paper has been fairly unclear. Is it planar graph signals or manifold graphs? or is it the fact that you artificially introduce the need for coordinates which essentially restricts your analysis to a handful of realistic cases, and in the other large set of cases where such an embedding is not available, you in turn estimate them from the given graph?

Response 12:

After two review rounds, at the risk of appearing redundant, we repeat the main contributions of the paper from the Introduction of the paper for the reviewer’s perusal:

1.

For graph nodes endowed with latent space coordinates, we define the higher-order Gradient Graph Laplacian Regularizer (GGLR) for graph signal restoration, which retains its quadratic form and thus ease of computation under specified conditions (Theorem 3).
2.

We prove GGLR’s promotion of planar signal reconstruction (Theorem 1)—planar signals are 0-frequencies of PSD GNG Laplacian ${\mathcal{L}}$ . We demonstrate that its signal-dependent variant SDGGLR promotes PWP signal reconstruction.
3.

For manifold graphs without coordinates, we propose a fast parameter-free graph embedding method (Algorithm 1) to first obtain latent coordinates for graph nodes via fast eigenvector computation before employing SDGGLR.
4.

We derive the MSE-minimizing weight parameter $\mu$ for GGLR in a MAP formulation, trading off bias and variance of the signal estimate (Theorem 4).
5.

We demonstrate the efficacy of GGLR in four real-world graph signal restoration applications, outperforming previous graph smoothness priors.

We repeat again that the computed node coordinates form a graph embedding, whose pairwise Euclidean distances reflect the corresponding node-to-node network distances in the original graph. Blindly calling our graph embedding (or any graph embedding) “artificially introduce”, when there is a whole graph embedding topic in the machine learning literature [hamilton17, xu21], seems very misinformed. This is particular the case when there exists no linear graph signal prior in the literature, and the reviewer has failed to provide any obvious alternative after two review rounds.

Studying graphs in a restricted setting is wholly consistent with the graph signal processing (GSP) literature; it is common to study graphs with only positive edges [pang17], with or without self-loops [anis16], balanced signed graphs [yang21], directed acyclic graphs (DAG) [Dinesh2023ModelingVI], etc. In this paper, we propose a new signal prior GGLR exclusively for manifold graphs, where the manifold assumption on graph is rampant in the GSP literature [Hurtado2022StudyOM, Yang2021GroupWiseHI], with broad applications as shown in the results section.

Comment 13: Detailed response 32: The suggestion was owing to the excellent match of the topic to the specific journal that specializes in the topic of graph signals and is more targeted and of to the audience of interest. Is it not the reviewers right to raise a suggestion?

Response 13:

In the published SPS reviewer guidelines⁶⁶6https://signalprocessingsociety.org/publications-resources/guidelines-reviewers, it is stated that

There are two criteria necessary for a recommendation of acceptance for publication: NOVELTY (new or innovative methods or approaches to a problem of engineering, science, or mathematics) and APPROPRIATENESS (a complete, well-written manuscript that falls within the scope of the transactions to which it was submitted).

It is clear from the second criterion that a reviewer’s task is to determine whether a paper is within scope for a journal or conference. Given a paper is within scope, a paper cannot be rejected because a reviewer personally deems the paper is “more suitable” to another journal or conference. For example, should a paper on image compression submitted to IEEE International Conference on Image Processing (ICIP), which is clearly within scope of the conference, be rejected solely because a reviewer subjectively deems the paper is “more suitable” to a more narrowly scoped conference such as Picture Coding Symposium (PCS)? Or should a paper on speech recognition submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), which is within scope, be rejected because the reviewer thinks the paper is “more suitable” to more narrowly scoped conference INTERSPEECH? Clearly, this is not a valid reason to reject the paper, nor the intent of the organized conferences.

Thus, following the published guidelines, we humbly ask the reviewer to comment only on the technical merrits of our submitted manuscript and whether our paper is within TSP scope.

Comment 14: Typo: VBC or this graph on page 15 — should be VBC for this graph?

Also, why is PSNR a meaningful metric for the FIFA and Age estimation datasets over relative mean square error, given that it is not a visual dataset/image dataset?

Overall, the manuscript reads much better now. However, as pointed above, there needs to be more clarity on certain key aspects, particularly in terms of definition and motivation.

Response 14: The typo in “VBC of this graph” is fixed. Thank you for pointing it out.

We first note that peak signal-to-noise ratio (PSNR) is an inverse function of mean square error (MSE)⁷⁷7https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio:

\displaystyle\text{PSNR}=20\log_{10}\left(\frac{\text{MAX}_{I}^{2}}{MSE}\right)

(7)

where $\text{MAX}_{I}$ is the maximum value a sample can take on. Thus, a higher PSNR must necessarily mean a smaller MSE, and we do not see a difference presenting PSNR instead of MSE.

For thoroughness, for the FIFA and age estimation datasets, beyond PSNR we also added another metric accuracy (Acc) for comparison: accuracy is the number of correct predictions divided by the total number of predictions. Quantitative results in PSNR and accuracy are shown in Table VI of the revised manuscript.