Generalized Laplacian Regularized Framelet Graph Neural Networks

Zhiqi Shao¹¹1University of Sydney ([email protected], [email protected], [email protected], [email protected]) Andi Han¹¹footnotemark: 1 Dai Shi²²2Western Sydney University ([email protected])
Zhiqi Shao, Andi Han and Dai Shi are with equal contribution.¹¹footnotemark: 1 Andrey Vasnev ¹¹footnotemark: 1 Junbin Gao¹¹footnotemark: 1

Abstract

Graph neural networks (GNNs) have achieved remarkable results for various graph learning tasks. However, one of the recent challenges for GNNs is to adapt to different types of graph inputs, such as heterophilic graph datasets in which linked nodes are more likely to contain a different class of labels and features. Accordingly, an ideal GNN model should adaptively accommodate all types of graph datasets with different labeling distributions. In this paper, we tackle this challenge by proposing a regularization framework on graph framelet with the regularizer induced from graph $p$ -Laplacian. By adjusting the value of $p$ , the $p$ -Laplacian based regularizer restricts the solution space of graph framelet into the desirable region based on the graph homophilic features. We propose an algorithm to effectively solve a more generalized regularization problem and prove that the algorithm imposes a ( $p$ -Laplacian based) spectral convolution and diagonal scaling operation to the framelet filtered node features. Furthermore, we analyze the denoising power of the proposed model and compare it with the predefined framelet denoising regularizer. Finally, we conduct empirical studies to show the prediction power of the proposed model in both homophily undirect and heterophily direct graphs with and without noises. Our proposed model shows significant improvements compared to multiple baselines, and this suggests the effectiveness of combining graph framelet and $p$ -Laplacian.

Highlight

•

We define two types of new GNN layers by introducing $p$ -Laplacian regularizers to both decomposed and reconstructed framelets.
•

We provide an iterative algorithm to fit the proposed models and explore an alternative algorithm developed from the F-norm Locality Preserving Projection (LPP).
•

We prove that our algorithm provides a sequence of spectral graph convolutions and diagonal scaling over framelet-filtered graph signals and this further explains the interaction between $p$ -Laplacian regularizer and framelet.
•

We link the proposed $p$ -Laplacian regularizer with the previous work of framelet regularizer to illustrate the denoising power of our model.
•

We test the model on homophilic (undirected) and heterophilic (directed) graphs. To our best knowledge, we are the first to explore the possibility of applying the framelet GCN for directed graphs

1 Introduction

Graph neural networks (GNNs) have demonstrated remarkable ability for graph learning tasks (Bronstein et al, 2017; Wu et al, 2020; Zhang et al, 2022; Zhou et al, 2020). The input to GNNs is the so-called graph data which records useful features and structural information among data. Such data are widely seen in many fields, such as biomedical science (Ahmedt-Aristizabal et al, 2021), social networks (Fan et al, 2019), and recommend systems (Wu et al, 2022). GNN models can be broadly categorized into spectral and spatial methods. The spatial methods such as MPNN (Gilmer et al, 2017), GAT (Veličković et al, 2018) and GIN (Xu et al, 2018a) utilize the message passing mechanism to propagate node feature information based on their neighbours (Scarselli et al, 2009). On the other hand, the spectral methods including ChebyNet (Defferrard et al, 2016), GCN (Kipf and Welling, 2017) and BernNet (He et al, 2021) are derived from the classic convolutional networks, treating the input graph data as signals (i.e., a function with the domain of graph nodes) (Ortega et al, 2018), and filtering signals in the Fourier domain (Bruna et al, 2014; Defferrard et al, 2016). Among various spectral-based GNNs, the graph framelets GCN (Zheng et al, 2022), which is constructed on a type of wavelet frame analogized from the identical concept defined on manifold (Dong, 2017), further separates the input signal by predefined low-pass and high-pass filtering functions and resulting a convolution-type model as in the studies (Chen et al, 2022a; Zheng et al, 2021b, 2022). The graph framelet shows great flexibility in terms of controlling both low and high-frequency information with robustness to noise and thus in general possesses high prediction power in multiple graph learning tasks (Chen et al, 2022a; Yang et al, 2022; Zheng et al, 2021b; Zhou et al, 2021a, b; Zou et al, 2022).

Along with the path of developing more advanced models, one of the major challenges in GNN is identified from the aspect of data consistency. For example, many GNN models (Xu et al, 2018a; Wu et al, 2020; Zhu et al, 2020b; Wu et al, 2021; Liu et al, 2022) are designed based on the homophily assumption i.e., nodes with similar features or labels are often linked with each other. Such phenomenon can be commonly observed in citation networks (Ciotti et al, 2016) which have been widely used as benchmarks in GNNs empirical studies. However, the homophily assumption may not always hold, and its opposite, i.e. Heterophily, can be observed quite often in many real-world datasets in which the linked nodes are more likely to contain different class labels and features (Zheng et al, 2021a). For example, in online transaction networks, fraudsters are more likely to link to customers instead of other fraudsters (Pandit et al, 2007). The GNNs designed under homophilic assumption are deemed unsuitable for heterophilic graphs. It is evident from their significant performance degradation (Zheng et al, 2021a). The reason is that the class of heterophilic graphs contains heterogeneous instances and hence the signals should be sharpened rather than smoothed out. An ideal framework for learning on graphs should be able to accommodate both homophilic and heterophilic scenarios.

One of the active aspects to resolve the GNN adaption challenge is by regularizing the solution of GNNs via the perspective of optimization. The work done by Zhu et al (2021) unified most of state-of-the-art GNNs as an optimization framework. Furthermore, one of the recent works (Fu et al, 2022) considered assigning an adjustable $p$ -Laplacian regularizer to the (discrete) graph regularization problem that is conventionally treated as a way of producing GNN outcomes (i.e., Laplacian smoothing). In view of the fact that the classic graph Laplacian regularizer measures the graph signal energy along edges under $L_{2}$ metric, it would be beneficial if GNN could be adapted to heterophilic graphs under $L_{p}$ metric ( $1\leq p<2$ ). Given that $L_{1}$ metric is more robust to high-frequency signals, a higher model discriminative power is preserved. The early work (Hu et al, 2018) has demonstrated the advantage of adopting $L_{1}$ metric in the Locality Preserving Projection (LPP) model. In addition, the recently proposed $p$ -Laplacian GNN (Fu et al, 2022) adaptively modifies aggregation weights by exploiting the variance of node embeddings on edges measured by the graph gradient. With a further investigation on the application of $p$ -Laplacian, Luo et al (2010) suggested an efficient gradient descent optimization strategy to construct the $p$ -Laplacian embedding space that guarantees convergence to viable local solutions. Although the $p$ -Laplacian based optimization scheme has been successfully lodged in basic GNN model (Fu et al, 2022), whether it can be further incorporated with more advanced GNN model (i.e., graph framelets) with higher prediction power and flexibility is still unclear. Specifically, one may be interested in identifying whether the advantage of deploying $p$ -Laplacian in GNN still remains in graph framelet or what is the exact functionality of $p$ -Laplacian interacting with the feature representations generated both low and high pass framelet domains. In addition, as graph framelets have shown great potential in terms of the graph noise reduction (Zheng et al, 2022; Yang et al, 2022; Zou et al, 2022), whether the inclusion of $p$ -Laplacian regularizer could enhance/dilute such power remains unclear. These research gaps inspire us to incorporate $p$ -Laplacian into graph framelets and explore further for the underlying relationship between them.

Since the $p$ -Laplacian regularized optimization problem lacks a closed-form solution, except for when $p=2$ (Zhou and Schölkopf, 2005), an iterative algorithm is suggested to estimate the solution and each iteration can be analogized to a GNN layer (Zhou et al, 2021a)). The solution of such an algorithm is defined by the first-order condition of the optimization problem. As a result, one can relate this to the implicit layer approach (Chen et al, 2022b; Zhang et al, 2003) which has the potential of avoiding over-smoothing issues since the adjusted node feature will be re-injected in each iteration step. By adjusting the value of $p$ , both $p$ -Laplacian GNN and the algorithm are capable of producing learning outcomes with different discriminative levels and thus able to handle both homophilic and heterophilic graphs. Similar work has been done by Frecon et al (2022) which presents a framework based on the Bregman layer to fulfill the bi-level optimization formulations. To reap the benefit of $p$ -Laplacian regularization in the framelet domain, in this paper, we propose two $p$ -Laplacian Regularized Framelet GCN models which are named $p$ -Laplacian Undecimated Framelet GCN (pL-UFG) and $p$ -Laplacian Fourier Undecimated Framelet GCN (pL-fUFG). In these models, the $p$ -Laplacian regularization can be applied on either the framelet domain or the well-known framelet Fourier domain. The $p$ -Laplacian-based regularizer incurs a penalty to both low and high-frequency domains created by framelet, producing flexible models that are capable of adapting different types of graph inputs (i.e., homophily and heterophily, direct and undirect). We summarize our contributions as follows:

•

We define two types of new GNN layers by introducing $p$ -Laplacian regularizers to both decomposed and reconstructed framelets. This paves the way for introducing more general Bregman divergence regularization in the graph framelet framework;
•

We propose an iterative algorithm to fit the proposed models and explore an alternative algorithm developed from the F-norm LPP;
•

We prove that the iteration of the proposed algorithm provides a sequence of spectral graph convolutions and diagonal scaling over framelet-filtered graph signals, this gives an deeper explanation of how $p$ -Laplacian regularizer interacts with the framelet.
•

We connect the proposed $p$ -Laplacian regularizer to the previously studied framelet regularizer to illustrate the denoising power of the proposed model.
•

We investigate the performance of the new models on graph learning tasks for both homophilic (undirected) graphs and heterophilic (directed) graphs. To our best knowledge, we are the first to explore the possibility of applying the framelet GCN for directed graphs. The experiment results demonstrate the effectiveness of pL-UFG and pL-fUFG on real-world node classification tasks with strong robustness.

The rest of the paper is organized as follows. In Section 2, we introduce the $p$ -Laplacian operator and regularized GNN, followed by a review of recent studies on regularized graph neural networks, which include implicit layers and graph homophily. In Section 3, we introduce the fundamental properties of graph framelet and propose the $p$ -Laplacian regularized framelet models. Furthermore, we develop an algorithm to solve the regularization problem that is more general than $p$ -Laplacian regularization. We also provide theoretical analysis to show how the $p$ -Laplacian based regularizer interacts with the graph framelet. By the end of Section 3 a brief discussion on the denoising power of the proposed model is presented. Lastly, we present the experimental results and analysis in Section 5. Finally, this paper is concluded in Section 6.

2 Preliminaries

This section provides an in-depth exploration of the fundamental concepts, encompassing graphs, graph framelets, and regularized graph neural networks. For the sake of reader comprehension and ease of following the intended ideas, each newly introduced model and definition will be accompanied by a succinct review of its developmental history.

2.1 Basic Notations

Let $\mathcal{G}=(\mathcal{V},\mathcal{E},W)$ denote a weighted graph, where $\mathcal{V}=\{v_{1},v_{2},\cdots,v_{N}\}$ and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ represent the node set and the edge set, respectively. ${\bf X}\in\mathbb{R}^{N\times c}$ is the feature matrix for $\mathcal{G}$ with $\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}$ as its rows, and $W=[w_{i,j}]\in\mathbb{R}^{N\times N}$ is the weight matrix on edges with $w_{i,j}>0$ if $(v_{i},v_{j})\in\mathcal{E}$ and $w_{i,j}=0$ otherwise. For undirected graphs, we have $w_{i,j}=w_{j,i}$ which means that $W$ is a symmetric matrix. For directed graphs, it is likely that $w_{i,j}\neq w_{j,i}$ which means that $W$ may not be a symmetric matrix. In most cases, the weight matrix is the graph adjacency matrix, i.e., $w_{i,j}\in\{0,1\}$ with elements $w_{i,j}=1$ if $(v_{i},v_{j})\in\mathcal{E}$ and $w_{i,j}=0$ otherwise. Furthermore, let $\mathcal{N}_{i}=\{v_{j}:(v_{i},v_{j})\in\mathcal{E}\}$ denote the set of neighbours of node $v_{i}$ and $\mathbf{D}=\text{diag}(d_{1,1},...,d_{N,N})\in\mathbb{R}^{N\times N}$ denote the diagonal degree matrix with $d_{i,i}=\sum^{N}_{j=1}w_{i,j}$ for $i=1,...,N$ . The normalized graph Laplacian is defined as $\widetilde{\mathbf{L}}=\mathbf{I}-\mathbf{D}^{-\frac{1}{2}}(\mathbf{W}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}$ . Lastly, for any vector $\mathbf{x}=(x_{1},...,x_{c})\in\mathbb{R}^{c}$ , we use $\|\mathbf{x}\|_{2}=(\sum^{c}_{i=1}x^{2}_{i})^{\frac{1}{2}}$ to denote its L₂-norm, and similarly for any matrix $\mathbf{M}=[m_{i,j}]$ , $\|\mathbf{M}\|:=\|\mathbf{M}\|_{F}=(\sum_{i,j}m^{2}_{i,j})^{\frac{1}{2}}$ is used to denote its Frobenius norm.

2.2 Consistency in Graphs

Generally speaking, most GNN frameworks are designed under the homophily assumption in which the labels of nodes and neighbours in the graph are mostly identical. The recent work by Zhu et al (2020a) emphasises that the general topology GNN fails to obtain outstanding results on the graphs with different class labels and dissimilar features in their connected nodes, which we call heterophilic or low homophilic graphs. The definition of homophilic and heterophilic graphs are given by:

Definition 2.1 (Homophily and Heterophily (Fu et al, 2022)).

The homophily or heterophily of a network is used to define the relationship between labels of connected nodes. The level of homophily of a graph can be measured by $\mathcal{H(G)}=\mathbb{E}_{i\in\mathcal{V}}[|\{j\}_{j\in\mathcal{N}_{i,y_{i}=y_{i}}}|/|\mathcal{N}_{i}|]$ , where $|\{j\}_{j\in\mathcal{N}_{i,y_{i}=y_{i}}}|$ denotes the number of neighbours of $i\in V$ that share the same label as $i$ such that $y_{i}=y_{j}$ . $\mathcal{H(G)}\rightarrow 1$ corresponds to strong homophily while $\mathcal{H(G)}\rightarrow 0$ indicates strong heterophily. We say that a graph is a homophilic (heterophilic) graph if it has strong homophily (heterophily).

2.3 $p$ -Laplacian Operator

The first paper that explores the notation of graph $p$ -Laplacian and utilizes it in the regularization problem in graph structure data can be traced back to the work (Zhou and Schölkopf, 2005), in which the regularization scheme was further explored to build a flexible GNN learning model in the study (Fu et al, 2022). In this paper, we refer to the notation similar to the one used in the paper (Fu et al, 2022) to define the graph $p$ -Laplacian operator. We first define the graph gradient as follows:

Definition 2.2 (Graph Gradient).

Let $\mathcal{F_{V}}:=\{\mathbf{F}|\mathbf{F}:\mathcal{V}\rightarrow\mathbb{R}^{d}\}$ and $\mathcal{F_{E}}:=\{\mathbf{g}|\mathbf{g}:\mathcal{E}\rightarrow\mathbb{R}^{d}\}$ be the function space on nodes and edges, respectively. Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E},W)$ and a function $\mathbf{F}\in\mathcal{F_{V}}$ , the graph gradient is an operator $\nabla_{W}$ : $\mathcal{F_{V}}\rightarrow\mathcal{F_{E}}$ defined as for all $[i,j]\in\mathcal{E}$ ,

(\nabla_{W}\mathbf{F})([i,j]):=\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}.

where $\mathbf{f}_{i}$ and $\mathbf{f}_{j}$ are the signal vectors on nodes $i$ and $j$ , i.e., the rows of $\mathbf{F}$ .

Without confusion, we will simply denote $\nabla_{W}$ as $\nabla$ for convenience. For $[i,j]\notin\mathcal{E}$ , $(\nabla\mathbf{F})([i,j]):=0$ . The graph gradient of $\mathbf{F}$ at a vertex $i$ , $\forall i\in\{1,...,N\}$ , is defined as $\nabla\mathbf{F}(i):=((\nabla\mathbf{F})([i,1]);\dots;(\nabla\mathbf{F})([i,N]))$ and its Frobenius norm is given by $\|\nabla\mathbf{F}(i)\|_{2}:=(\sum^{N}_{j=1}(\nabla\mathbf{F})^{2}([i,j]))^{\frac{1}{2}}$ which measures the variation of $\mathbf{F}$ around node $i$ . Note that we have two explanations for the notation $\nabla\mathbf{F}$ : one as a graph gradient (over edges) and one as node gradient (over nodes). The meaning can be inferred from its context in the rest of the paper. We also provide the definition of graph divergence, analogous to the Laplacian operator in continuous setting, which is the divergence of the gradient of a continuous function.

Definition 2.3 (Graph Divergence).

Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E},W)$ and a function $\mathbf{F}:\mathcal{V}\rightarrow\mathbb{R}^{d}$ , $\mathbf{g}:\mathcal{E}\rightarrow\mathbb{R}^{d}$ , the graph divergence is an operator $\text{div}:\mathcal{F}_{\mathcal{E}}\rightarrow\mathcal{F}_{\mathcal{V}}$ which satisfies:

\displaystyle\langle\nabla\mathbf{F},\mathbf{g}\rangle=\langle\mathbf{F},-\text{div}(\mathbf{g})\rangle.

(1)

Furthermore, the graph divergence can be computed by:

\displaystyle\text{div}(\mathbf{g})(i)=\sum_{j=1}^{N}\sqrt{\frac{w_{i,j}}{d_{i,i}}}(\mathbf{g}[i,j]-\mathbf{g}[j,i]).

(2)

Given the above definitions on graph gradient and divergence, we reach the definition of the graph $p$ -Laplacian.

Definition 2.4 ( $p$ -Laplacian operator (Fu et al, 2022)).

Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E},W)$ and a multiple channel signal function $\mathbf{F}:\mathcal{V}\rightarrow\mathbb{R}^{d}$ , the graph $p$ -Laplacian is an operator $\Delta_{p}:\mathcal{F}_{\mathcal{V}}\rightarrow\mathcal{F}_{\mathcal{V}}$ , defined by:

\displaystyle\Delta_{p}\mathbf{F}:=-\frac{1}{2}\text{div}(\|\nabla\mathbf{F}\|^{p-2}\nabla\mathbf{F}),\quad\text{for}\,\,\,p\geq 1.

(3)

where $\|\cdot\|^{p-2}$ is element-wise power over the node gradient $\nabla\mathbf{F}$ .

Remark 1.

Clearly, when we have $p=2$ , Eq. (3) recovers the classic graph Laplacian. When $p=1$ , we can analogize Eq. (3) as a curvature operator defined on the nodes of the graph, because when $p=1$ , we have Eq. (3) as $\Delta_{1}\mathbf{F}=-\frac{1}{2}\text{div}\left(\frac{\nabla\mathbf{F}}{\|\nabla\mathbf{F}\|}\right)$ . This aligns with the definition of mean curvature operator defined on the continuous domain. Furthermore, we note that the $p$ -Laplacian operator is linear when $p=2$ , while in general for $p\neq 2$ , the $p$ -Laplacian is a non-linear operator since $\Delta_{p}(a\mathbf{F})\neq a\Delta_{p}(\mathbf{F})\,\,\,\text{for}\,\,\,a\in\mathbb{R}\backslash\{1\}$ .

Definition 2.5 ( $p$ -eigenvector and $p$ -eigenvalue).

Let $\psi_{p}(\mathbf{v})=(\|v_{1}\|^{p-2}v_{1},...,\allowbreak\|v_{N}\|^{p-2}v_{N})$ for $\mathbf{v}\in\mathbb{R}^{N}$ . With some abuse of the notation, we call $\mathbf{u}\in\mathcal{F}_{V}$ ( $d=1$ ) a $p$ -eigenvector of $\Delta_{p}$ if $\Delta_{p}\mathbf{u}=\lambda\psi_{p}(\mathbf{u})$ where $\lambda$ is the associated $p$ -eigenvalue.

Note that in above definition, we use the fact that $\mathbf{u}$ acting on all nodes gives a vector in $\mathbb{R}^{N}$ . Let $\psi_{p}(\mathbf{U})=(\psi_{p}(\mathbf{U}_{;,1}),...,\psi_{p}(\mathbf{U}_{:,N}))$ for $\mathbf{U}\in\mathbb{R}^{N\times N}$ . One (Fu et al, 2022) can show that $p$ -Laplacian can be decomposed as $\Delta_{p}=\psi_{p}(\mathbf{U})\mathbf{\Lambda}\psi_{p}(\mathbf{U})^{T}$ for some diagonal matrix $\boldsymbol{\Lambda}$ .

2.4 Graph Framelets

Framelet is a type of wavelet frame. The study by Sweldens (1998) is the first to present a wavelet with a lifting scheme that provides a foundation in the research of wavelet transform on graphs. With the increase of computational power, Hammond et al (2011) proposed a framework for the wavelet transformation on graphs and employed Chebyshev polynomials to make approximations on wavelets. Dong (2017) further developed tight framelets on graphs. The new design has been applied for graph learning tasks (Zheng et al, 2021b) with great performance enhancement against the classic GNNs. The recent studies show that framelet can naturally decompose the graph signal and re-aggregate them effectively, achieving a significant result on graph noise reduction (Zhou et al, 2021a) with the use of a double term regularizer on the framelet coefficients. Combing with singular value decomposition (SVD), the framelets have been made applicable to directed graphs (Zou et al, 2022).

A simple method of building more versatile and stable framelet families was suggested by Yang et al (2022) which is known as Quasi-Framelets. In this study, we will introduce graph framelets using the same architecture described in the paper (Yang et al, 2022). We begin by defining the filtering functions for Quasi-Framelets:

Definition 2.6 (Filtering functions for Quasi-Framelets).

We call a set of $R+1$ positive filtering functions defined on $[0,\pi]$ , $\mathcal{F}=\{g_{0}(\xi),g_{1}(\xi),...,g_{R}(\xi)\}$ , Quasi-Framelet scaling functions if the following identity condition is satisfied:

\displaystyle g_{0}(\xi)^{2}+g_{1}(\xi)^{2}+\cdots+g_{R}(\xi)^{2}\equiv 1,\;\;\;\forall\xi\in[0,\pi],

(4)

such that $g_{0}$ descends from 1 to 0 and $g_{R}$ ascends from 0 to 1 as frequency increases over the spectral domain $[0,\pi]$ .

Particularly $g_{0}$ aims to regulate the highest frequency while $g_{R}$ to regulate the lowest frequency, and the rest to regulate other frequencies in between.

Consider a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with its normalized graph Laplacian $\widetilde{\mathbf{L}}$ . Let $\widetilde{\mathbf{L}}$ have the eigen-decomposition $\widetilde{\mathbf{L}}=\mathbf{U}\Lambda\mathbf{U}^{T}$ where $\mathbf{U}$ is the orthogonal spectral bases with its spectra $\Lambda=\text{diag}(\lambda_{1},\lambda_{2},...,\lambda_{N})$ in increasing order. For a given set of Quasi-Framelet functions $\mathcal{F}=\{g_{0}(\xi),g_{1}(\xi),...,g_{R}(\xi)\}$ defined on $[0,\pi]$ and a given level $J$ ( $\geq 0$ ), define the following Quasi-Framelet signal transformation matrices

$\displaystyle\mathcal{W}_{0,J}$	$\displaystyle=\mathbf{U}g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m+J}})\cdots g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m}})\mathbf{U}^{T},$	(5)
$\displaystyle\mathcal{W}_{r,0}$	$\displaystyle=\mathbf{U}g_{r}(\frac{\boldsymbol{\Lambda}}{2^{m}})\mathbf{U}^{T},\;\;\text{for }r=1,...,R,$	(6)
$\displaystyle\mathcal{W}_{r,\ell}$	$\displaystyle=\mathbf{U}g_{r}(\frac{\boldsymbol{\Lambda}}{2^{m+\ell}})g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m+\ell-1}})\cdots g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m}})\mathbf{U}^{T},$	(7)
	$\displaystyle\text{for }r=1,...,R,\ell=1,...,J.$

Note that in the above definition, $m$ is called the coarsest scale level which is the smallest $m$ satisfying $2^{-m}\lambda_{N}\leq\pi$ . Denote by $\mathcal{W}=[\mathcal{W}_{0,J};\mathcal{W}_{1,0};...;\mathcal{W}_{R,0};$ $\mathcal{W}_{1,1};...,\mathcal{W}_{R,J}]$ as the stacked matrix. It can be proved that $\mathcal{W}^{T}\mathcal{W}=\mathbf{I}$ , thus providing a signal decomposition and reconstruction process based on $\mathcal{W}$ . We call this graph Framelet transformation.

In order to alleviate the computational cost imposed by eigendecomposition for the graph Laplacians, the framelet transformation matrices can be approximated by Chebyshev polynomials. Empirically, the implementation of the Chebyshev polynomials $\mathcal{T}^{n}_{r}(\xi)$ with a fixed degree $n$ , $n=3$ is sufficient to approximate $g_{r}(\xi)$ . In the sequel, we will simplify the notation by using $\mathcal{T}_{r}(\xi)$ instead of $\mathcal{T}^{n}_{r}(\xi)$ . Then the Quasi-Framelet transformation matrices are defined in Eqs. (6) - (7) can be approximated by,

$\displaystyle\mathcal{W}_{0,J}$	$\displaystyle\approx\mathcal{T}_{0}(\frac{1}{2^{m+J}}\widetilde{\mathbf{L}})\cdots\mathcal{T}_{0}(\frac{1}{2^{m}}\widetilde{\mathbf{L}}),$	(8)
$\displaystyle\mathcal{W}_{r,0}$	$\displaystyle\approx\mathcal{T}_{r}(\frac{1}{2^{m}}\widetilde{\mathbf{L}}),\;\;\;\text{for }r=1,...,R,$	(9)
$\displaystyle\mathcal{W}_{r,\ell}$	$\displaystyle\approx\mathcal{T}_{r}(\frac{1}{2^{m+\ell}}\widetilde{\mathbf{L}})\mathcal{T}_{0}(\frac{1}{2^{m+\ell-1}}\widetilde{\mathbf{L}})\cdots\mathcal{T}_{0}(\frac{1}{2^{m}}\widetilde{\mathbf{L}}),$	(10)
	$\displaystyle\;\;\;\;\text{for }r=1,...,R,\ell=1,...,J.$

Remark 1: The approximated transformation matrices defined in Eqs. (8)-(10) simply depend on the graph Laplacian. For directed graphs, we directly take in the Laplacian normalized by the out degrees in our experiments. We observe this strategy leads to improved performance in general.

For a graph signal $\mathbf{X}$ , the framelet (graph) convolution similar to the spectral graph convolution that can be defined as

\displaystyle\theta\star\mathbf{X}=\mathcal{W}^{T}(\textrm{diag}(\theta))(\mathcal{W}\mathbf{X}),

(11)

where $\theta$ is the learnable network filter. We also call $\mathcal{W}\mathbf{X}$ the framelet coefficients of $\mathbf{X}$ in Fourier domain. The signal then will be filtered in its spectral-domain according to learnable filter $\textrm{diag}(\theta)$ .

2.5 Regularized Graph Neural Network

In reality, graph data normally have large, noisy and complex structures (Leskovec and Faloutsos, 2006; He and Kempe, 2015). This brings the challenge of choosing a suitable neural network for fitting the data. It is well known that one of the common computational issues for most classic GNNs is over-smoothing which can be quantified by Dirichlet energy that converges to zero as the number of layers increases. This observation leads to an investigation into the so-called implicit layers on regularized graph neural networks.

The first paper to consider GNNs layer as a regularized signal smoothing process is done by Zhu et al (2021), in which the classic GNN layers are interpreted as the solution of the regularized optimization problem, with certain approximation strategies to avoid matrix inversion in the closed-form solution. The regularized layer also can be linked to the implicit layer (Zhang et al, 2003), more specific examples are given by Zhu et al (2021). In general, given the node features, the output of the GNN layer can be written as the solution for the following optimization problem

\displaystyle\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}\left\{\mu\|\mathbf{X}-\mathbf{F}\|^{2}_{F}+\frac{1}{2}\text{tr}(\mathbf{F}^{T}\widetilde{\mathbf{L}}\mathbf{F})\right\},

(12)

where $\mathbf{F}$ is the graph signal. Eq. (12) defines the layer output as the solution for the (layer) optimization problem with a regularization term $\text{tr}(\mathbf{F}^{T}\widetilde{\mathbf{L}}\mathbf{F})\}$ , which is able to enforce smoothness of a graph signal. The closed form solution is given by $\mathbf{F}=(\mathbf{I}+\frac{1}{2\mu}\widetilde{\mathbf{L}})^{-1}\mathbf{X}$ . It is computationally inefficient because of matrix inversion. However, we can rearrange it to $(\mathbf{I}+\frac{1}{2\mu}\widetilde{\mathbf{L}})\mathbf{F}=\mathbf{X}$ and interpret it as a fixed point solution to $\mathbf{F}^{(t+1)}=-\frac{1}{2\mu}\widetilde{\mathbf{L}}\mathbf{F}^{(t)}+\mathbf{X}$ . The latter is then the implicit layer in GNN at layer $t$ . Such an iteration is motivated by the recurrent GNNs as shown in several recent works (Gu et al, 2020; Liu et al, 2021a; Park et al, 2021). Different from explicit GNNs, the output features $\mathbf{F}$ of a general implicit GNN are directly modeled as the solution of a well defined implicit function, e.g., the first order condition from an optimization problem. Denote the rows of $\mathbf{F}$ by $\mathbf{f}_{i}$ ( $i=1,2,...,N$ ) in column shape. It is well known that the regularization term (12) is

\frac{1}{2}\text{tr}(\mathbf{F}^{T}\widetilde{\mathbf{L}}\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\|\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}\right\|^{2},

which is the so-called graph Dirichlet energy (Zhou and Schölkopf, 2005). As proposed in the work (Zhou and Schölkopf, 2005), one can replace the graph Laplacian matrix in the above equation to the pre-defined graph $p$ -Laplacian, then the $p$ -Laplacian based energy denoted as $\mathcal{S}_{p}(\mathbf{F})$ can be defined as (for any $p\geq 1$ ):

\displaystyle\mathcal{S}_{p}(\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\|\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}\right\|^{p},

(13)

where we adopt the definition of element-wise $p$ -norm as in paper (Fu et al, 2022). Finally, the regularization problem becomes

\displaystyle\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}\left\{\mu\|\mathbf{X}-\mathbf{F}\|^{2}_{F}+\mathcal{S}_{p}(\mathbf{F})\right\}.

(14)

With strong generalizability of the regularization form in Eq. (14), it is natural to consider to deploy Eq. 14 to multi-scale graph neural networks (i.e., graph framelets) to explore whether the benefits of allocating $p$ -Laplacian based regularizer still remains and how the changes of $p$ within included regularizer interacts with the feature propagation process via each individual filtered domains. In the next section, we will accordingly proposed our model and provide detailed discussion and analysis on it.

3 The Proposed Models

In this section, we show our proposed models: $p$ -Laplacian Framelet GNN (pL-UFG) and $p$ -Laplacian Fourier Undecimated Framelet GNN (pL-fUFG) models. In addition, we introduce a more general regularization framework and describe the algorithms of our proposed models.

3.1 $p$ -Laplacian Undecimated Framelet GNN (pL-UFG)

Instead of simply taking the convolution result as the framelet layer output (derived in Eq. (11)), we apply the $p$ -Laplacian regularization on the framelet reconstructed signal by imposing the following optimization to define the new layer,

\displaystyle\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}\mathcal{S}_{p}(\mathbf{F})+\mu\|\mathbf{F}-\mathcal{W}^{T}\textrm{diag}(\theta)\mathcal{W}\mathbf{X}\|^{2}_{F},

(15)

where $\mu\in(0,\infty)$ . In which we recall that the first term $\mathcal{S}_{p}(\mathbf{F})$ in the above equation is the $p$ -Laplacian based energy defined in Eq. (13). $\mathcal{S}_{p}(\mathbf{F})$ measures the total signals’ variation throughout the graph-based format of $p$ -Laplacian. The second term dictates that optimal $\mathbf{F}$ should not deviate excessively from the framelet reconstructed signal for the input signal $\mathbf{X}$ . Each component of the network filter $\theta$ in the frequency domain is applied to the framelet coefficients $\mathcal{W}\mathbf{X}$ . We call the model in Eq.(15) pL-UFG2.

One possible variant to model (15) is to apply the regularization individually on the reconstruction at each scale level, i.e., for all the $r,\ell$ , define

	$\displaystyle\mathbf{F}_{r,\ell}$	$\displaystyle=\operatorname*{arg\,min}_{\mathbf{F}_{r,\ell}}\mathcal{S}_{p}(\mathbf{F}_{r,\ell})+\mu\\|\mathbf{F}_{r,\ell}-\mathcal{W}^{T}_{r,\ell}\textrm{diag}(\theta_{r,\ell})\mathcal{W}_{r,\ell}\mathbf{X}\\|^{2}_{F},$
	$\displaystyle\mathbf{F}$	$\displaystyle=\mathbf{F}_{0,J}+\sum^{R}_{r=1}\sum^{J}_{\ell=0}\mathbf{F}_{r,\ell}.$		(16)

We name the model with the propagation in the above equation as pL-UFG1. In our experiment, we note that pL-UFG1 performs better than pL-UFG2.

3.2 $p$ -Laplacian Fourier Undecimated Framelet GNN (pL-fUFG)

In pL-fUFG, we take a strategy of regularizing the framelet coefficients. Informed by earlier experience, we consider the following optimization problem for each framelet transformation in Fourier domain. For each framelet transformation $\mathcal{W}_{r,j}$ defined in Eqs. (9)-(10), define

\displaystyle\mathbf{F}_{r,\ell}=\operatorname*{arg\,min}_{\mathbf{F}_{r,\ell}}\mathcal{S}_{p}(\mathbf{F}_{r,\ell})+\mu\|\mathbf{F}_{r,\ell}-\textrm{diag}(\theta_{r,\ell})\mathcal{W}_{r,\ell}\mathbf{X}\|^{2}_{F}.

(17)

Then the final layer output is defined by the reconstruction as follows

\displaystyle\mathbf{F}=\mathcal{W}^{T}_{0,J}\mathbf{F}_{0,J}+\sum^{R}_{r=1}\sum^{J}_{\ell=0}\mathcal{W}^{T}_{r,\ell}\mathbf{F}_{r,\ell}.

(18)

Or we can only aggregate all the regularized and filtered framelet coefficients in the following way

\displaystyle\mathbf{F}=\mathbf{F}_{0,J}+\sum^{R}_{r=1}\sum^{J}_{\ell=0}\mathbf{F}_{r,\ell}.

(19)

With the form of both pL-UFG and pL-fUFG, in the next section we show an even more generalized regularization framework by incorporating $p$ -Laplacian with graph framelets.

3.3 More General Regularization

Refer to caption — Figure 1: The figure above shows the working flow of the $p$ -Laplacian regularized framelet. The input graph data is first filtered and reconstructed by the framelet model which contains one low-pass and two high-pass filters (i.e., $J=2$ ). Then, the result (denoted as $\mathbf{Y}$ ) is further regularized by a sequence of graph convolution and diagonal rescaling induced by the $p$ -Laplacian, which is generated based on the graph gradient information, serving as an implicit layer of the model. By adjusting the $p$ value, node features resulting from this implicit layer can be smoothed or sharpened accordingly, thus making the model adopt both homophily and heterophilic graphs. Lastly, the layer output will then be either forwarded to the task objective function or to the next framelet and $p$ -Laplacian layers before the final prediction task.

For convenience, we write the graph gradient for multiple channel signals $\mathbf{F}$ as

\nabla_{W}\mathbf{F}([i,j])=\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}.

or simply $\nabla\mathbf{F}$ if no confusion. For an undirected graph we have $\nabla_{W}\mathbf{F}([i,j])=-\nabla_{W}\mathbf{F}([j,i])$ . With this notation, we can re-write the $p$ -Laplacian regularizer in (13) (the element-wise $p$ -norm) in the following.

		$\displaystyle\mathcal{S}_{p}(\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p}$		(20)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{v_{i}\in\mathcal{V}}\left[\left(\sum_{v_{j}\sim v_{i}}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p}\right)^{\frac{1}{p}}\right]^{p}=\frac{1}{2}\sum_{v_{i}\in\mathcal{V}}\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p}^{p}$

where $v_{j}\sim v_{i}$ stands for the node $v_{j}$ that is connected to node $v_{i}$ and $\nabla_{W}\mathbf{F}(v_{i})=\left(\nabla_{W}\mathbf{F}([i,j])\right)_{v_{j}:(v_{i},v_{j})\in\mathcal{E}}$ is the node gradient vector for each node $v_{i}$ and $\|\cdot\|_{p}$ is the vector $p$ -norm. In fact, $\|\nabla_{W}\mathbf{F}(v_{i})\|_{p}$ measures the variation of $\mathbf{F}$ in the neighbourhood of each node $v_{i}$ . Next, we generalize the regularizer by considering any positive convex function $\phi$ as

\displaystyle\mathcal{S}^{\phi}_{p}(\mathbf{F})=\frac{1}{2}\sum_{v_{i}\in\mathcal{V}}\phi(\|\nabla_{W}\mathbf{F}(v_{i})\|_{p}).

(21)

It is clear when $\phi(\xi)=\xi^{p}$ , we recover the $p$ -Laplacian regularizer. In image processing field, several penalty functions have been proposed in the literature. For example, $\phi(\xi)=\xi^{2}$ is known in the context of Tikhonov regularization (Li et al, 2022; Belkin et al, 2004; Assis et al, 2021). When $\phi(\xi)=\xi$ (i.e. $p=1$ ), it is the classic total variation regularization. When $\phi(\xi)=\sqrt{\xi^{2}+\epsilon^{2}}-\epsilon$ , it is referred as the regularized total variation. An example work can be found in the work (Cheng and Wang, 2020). The case of $\phi(\xi)=r^{2}\log(1+\xi^{2}/r^{2})$ corresponds to the non linear diffusion (Oka and Yamada, 2023).

In pL-UFG and pL-fUFG, we use $\mathbf{Y}$ to denote $\mathcal{W}^{T}\textrm{diag}(\theta)\mathcal{W}\mathbf{X}$ and $\textrm{diag}(\theta_{r,j})\mathcal{W}_{r,j}\mathbf{X}$ respectively, as then we propose the generalized $p$ -Laplacian regularization models as

\displaystyle\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}\mathcal{S}^{\phi}_{p}(\mathbf{F})+\mu\|\mathbf{F}-\mathbf{Y}\|^{2}_{F}.

(22)

3.4 The Algorithm

We derive an iterative algorithm for solving the generalized $p$ -Laplacian regularization problem (22), presented as the following theorem.

Theorem 3.1.

For a given positive convex function $\phi(\xi)$ , define

	$\displaystyle M_{i,j}=$	$\displaystyle\frac{w_{i,j}}{2}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p-2}\cdot\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{i})\\|^{p-1}_{p}}\right.+\left.\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{j})\\|^{p-1}_{p}}\right],$
	$\displaystyle\alpha_{ii}=$	$\displaystyle 1/\left(\sum_{v_{j}\sim v_{i}}\frac{M_{i,j}}{d_{i,i}}+2\mu\right),\quad\quad\beta_{ii}=2\mu\alpha_{ii}$

and denote the matrices $\mathbf{M}=[M_{i,j}]$ , $\boldsymbol{\alpha}=\text{diag}(\alpha_{11},...,\alpha_{NN})$ and
$\boldsymbol{\beta}=\text{diag}(\beta_{11},...,\beta_{NN})$ . Then the solution to problem (22) can be solved by the following message passing process

\displaystyle\mathbf{F}^{(k+1)}=\boldsymbol{\alpha}^{(k)}\mathbf{D}^{-1/2}\mathbf{M}^{(k)}\mathbf{D}^{-1/2}\mathbf{F}^{(k)}+\boldsymbol{\beta}^{(k)}\mathbf{Y},

(23)

with an initial value, e.g., $\mathbf{F}^{(0)}=\mathbf{0}$ .

Here $\boldsymbol{\alpha}^{(k)}$ , $\boldsymbol{\beta}^{(k)}$ and $\mathbf{M}^{(k)}$ are calculated according to the current features $\mathbf{F}^{(k)}$ . When $\phi(\xi)=\xi^{p}$ , from Eq. (23) we can have the algorithm for solving pL-UFG and pL-fUFG.

Proof: Define $\mathcal{L}^{\phi}_{p}(\mathbf{F})$ as the objective function in Eq. (22), consider a node feature $\mathbf{f}_{i}$ in $\mathbf{F}$ , then we have

		$\displaystyle\frac{\partial\mathcal{L}^{\phi}_{p}(\mathbf{F})}{\partial\mathbf{f}_{i}}=2\mu(\mathbf{f}_{i}\!-\!\mathbf{y}_{i})\!+\!\frac{1}{2}\sum_{v_{k}\in\mathcal{V}}\frac{\partial}{\partial\mathbf{f}_{i}}\phi(\\|\nabla_{W}\mathbf{F}(v_{k})\\|_{p})$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}-\mathbf{y}_{i})+\frac{1}{2}\frac{\partial}{\partial\mathbf{f}_{i}}\phi(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})+\frac{1}{2}\sum_{v_{j}\sim v_{i}}\frac{\partial}{\partial\mathbf{f}_{i}}\phi(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}-\mathbf{y}_{i})+\frac{1}{2}\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})\frac{1}{p}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})^{1-p}\frac{\partial}{\partial\mathbf{f}_{i}}\!\left(\!\sum_{v_{j}\sim v_{i}}\\|\nabla_{i,j}(w,\mathbf{f})\\|^{p}\!\right)\!$
		$\displaystyle+\!\frac{1}{2p}\sum_{v_{j}\sim v_{i}}\frac{\partial}{\partial\mathbf{f}_{i}}\\|\nabla_{j,i}(w,\mathbf{f})\\|^{p}\cdot\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})^{1-p}$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}-\mathbf{y}_{i})+\frac{1}{2p}\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})^{1-p}\cdot$
		$\displaystyle\quad\quad\left(\!\sum_{v_{j}\sim v_{i}}p\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\sqrt{\frac{w_{i,j}}{d_{i,i}}}(-\nabla_{W}\mathbf{F}([i,j]))\!\right)$
		$\displaystyle+\frac{1}{2}\sum_{v_{j}\sim v_{i}}\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})^{1-p}\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\sqrt{\frac{w_{i,j}}{d_{i,i}}}\nabla_{W}\mathbf{F}([j,i])$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}\!-\!\mathbf{y}_{i})\!+\!\sum_{v_{j}\sim v_{i}}\!\frac{1}{2}\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\!\!\sqrt{\frac{w_{i,j}}{d_{i,i}}}\nabla_{W}\mathbf{F}([j,i])\cdot$
		$\displaystyle\quad\quad\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{i})\\|^{p-1}_{p}}+\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{j})\\|^{p-1}_{p}}\right]$

Setting $\frac{\partial\mathcal{L}^{\phi}_{p}(\mathbf{F})}{\partial\mathbf{f}_{i}}=0$ gives the following first order condition:

		$\displaystyle\bigg{(}\sum_{v_{j}\sim v_{i}}\frac{1}{2}\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{i})\\|^{p-1}_{p}}+\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{j})\\|^{p-1}_{p}}\right]\cdot\\|\nabla_{i,j}(w,\mathbf{f})\\|^{p-2}\frac{w_{i,j}}{d_{i,i}}+2\mu\bigg{)}\mathbf{f}_{i}$
	$\displaystyle=$	$\displaystyle\sum_{v_{j}\sim v_{i}}\frac{1}{2}\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{i})\\|^{p-1}_{p}}+\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{j})\\|^{p-1}_{p}}\right]\cdot\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\frac{w_{i,j}}{\sqrt{d_{i,i}d_{j,j}}}\mathbf{f}_{j}+2\mu\mathbf{y}_{i}.$

This equation defines the message passing on each node $v_{i}$ from its neighbour nodes $v_{j}$ . With the definition of $M_{i,j}$ , $\alpha_{ii}$ and $\beta_{ii}$ , it can be turned into the iterative formula in Eq. (23). This completes the proof.

Remark 2.

When $p\leq 1$ , the objective function is not differentiable in some extreme cases for example the neighbor node signals are the same and with the same degrees. In these rare cases, the first order condition cannot be applied in the above proof. However in practice, we suggest the following alternative iterative algorithm to solve the optimization problem. In fact, we can split the terms in $\mathcal{S}_{p}(\mathbf{F})$ as

\displaystyle\mathcal{S}_{p}(\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\|\nabla_{W}\mathbf{F}([i,j])\right\|^{p-2}\left\|\nabla_{W}\mathbf{F}([i,j])\right\|^{2}.

(24)

At iteration $k$ , we take $w^{\text{new}}_{i,j}=w_{i,j}\left\|\nabla_{W}\mathbf{F}([i,j])\right\|^{p-2}$ as the new edge weights, then the next iterate is defined as the solution to optimize the Dirichlet energy with the new weights, i.e.,

\mathbf{F}^{(k+1)}=\operatorname*{arg\,min}\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\|\nabla_{W^{\text{new}}}\mathbf{F}([i,j])\right\|^{2}.

Thus one step of the classic GCN can be applied in the iteration to solve the $p$ -Laplacian regularized problem (14).

3.5 Interaction between $p$ -Laplacian and Framelets

In this section, we present some theoretical support on how the $p$ -Laplacian regularizer interact with the framelets in the model. Specifically, we show that the proposed algorithm in Eq. (23) provides a sequence of ( $p$ -Laplacian-based) spectral graph convolutions and diagonal scaling of the node features over the framelet filtered graph signals.This is indicated by the following analysis.

First considering the iteration Eq. (23) with initial values $\mathbf{F}^{(0)}=\mathbf{Y}=\mathcal{W}^{T}\textrm{diag}(\theta)\mathcal{W}\mathbf{X}$ or $\textrm{diag}(\theta_{r,j})\mathcal{W}_{r,j}\mathbf{X}$ without loss of generality³³3In our practical algorithm we choose $\mathbf{F}^{(0)}=0$ , this will gives $\mathbf{F}^{(1)}=\beta^{(0)}\mathbf{Y}$ with a diagonal matrix $\beta^{(0)}$ ., we have

		$\displaystyle\mathbf{F}^{(K)}=\boldsymbol{\alpha}^{(K\!-\!1)}\mathbf{D}^{-1/2}\mathbf{M}^{(K-1)}\mathbf{D}^{-1/2}\mathbf{F}^{(K\!-\!1)}\!+\!\boldsymbol{\beta}^{(K-1)}\mathbf{Y}$
	$\displaystyle=$	$\displaystyle\boldsymbol{\alpha}^{(K-1)}\widetilde{\mathbf{M}}^{(K-1)}\mathbf{F}^{(K-1)}+\boldsymbol{\beta}^{(K-1)}\mathbf{Y}$
	$\displaystyle=$	$\displaystyle\boldsymbol{\alpha}^{(K-1)}\widetilde{\mathbf{M}}^{(K-1)}\left(\boldsymbol{\alpha}^{(K-2)}\widetilde{\mathbf{M}}^{(K-2)}\mathbf{F}^{(K-2)}+\boldsymbol{\beta}^{(K-2)}\mathbf{Y}\right)+\boldsymbol{\beta}^{(K-1)}\mathbf{Y}$
	$\displaystyle=$	$\displaystyle\boldsymbol{\alpha}^{(K-1)}\widetilde{\mathbf{M}}^{(K-1)}\boldsymbol{\alpha}^{(K-2)}\widetilde{\mathbf{M}}^{(K-2)}\mathbf{F}^{(K-2)}+\boldsymbol{\alpha}^{(K-1)}\widetilde{\mathbf{M}}^{(K-1)}\boldsymbol{\beta}^{(K-2)}\mathbf{Y}+\boldsymbol{\beta}^{(K-1)}\mathbf{Y}$
	$\displaystyle=$	$\displaystyle\left(\prod_{k=0}^{K-1}\boldsymbol{\alpha}^{(k)}\widetilde{\mathbf{M}}^{(k)}\right)\mathbf{Y}+\boldsymbol{\beta}^{(K-1)}\mathbf{Y}+\sum_{k=0}^{K-1}\left(\prod_{l=K-k}^{K-1}\boldsymbol{\alpha}^{(l)}\widetilde{\mathbf{M}}^{(l)}\right)\boldsymbol{\beta}^{(K-k-1)}\mathbf{Y}.$

The result $\mathbf{F}^{(K)}$ depends on the key operations $\boldsymbol{\alpha}^{(k)}\widetilde{\mathbf{M}}^{(k)}$ for $k=0,1,...,K-1$ , where

	$\displaystyle\widetilde{M}^{(k)}_{i,j}=\frac{w_{i,j}}{\sqrt{d_{i,i}d_{j,j}}}\left\\|\nabla_{W}\mathbf{F}^{(k)}[(i,j)]\right\\|^{p-2}\!\!\!\cdot\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\\|^{p-1}_{p}}+\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}^{(k)}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}^{(k)}(v_{j})\\|^{p-1}_{p}}\right]$
	$\displaystyle=\frac{w_{i,j}\cdot w^{(k)}_{i,j}(p)}{\sqrt{d_{i,i}d_{j,j}}}\left\\|\nabla_{W}\mathbf{F}^{(k)}[(i,j)]\right\\|^{p-2},$		(25)
	$\displaystyle\alpha_{i,i}^{(k)}=1/\left(\sum_{v_{j}\sim v_{i}}{\widetilde{M}^{(k)}_{i,j}}+2\mu\right),$		(26)

where $\widetilde{M}_{ij}$ has absorbed $d_{i,i}$ and $d_{j,j}$ into $M_{ij}$ defined in Theorem 3.1, i.e., $\widetilde{M}_{ij}=M_{i,j}/\sqrt{d_{i,i}d_{j,j}}$ .

Denote

w^{(k)}_{i,j}(p)=\left[\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\|^{p-1}_{p}}+\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k)}(v_{j})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k)}(\!v_{j}\!)\|^{p-1}_{p}}\right].

To introduce the following analysis, define the relevant matrices $\widetilde{\mathbf{M}}^{(k)}=[\widetilde{M}^{(k)}_{i,j}]$ , $\boldsymbol{\alpha}^{(k)}=\text{diag}(\alpha_{ii}^{(k)})$ and $\mathbf{W}^{(k)}_{p}=[w^{(k)}_{i,j}(p)]=\mathbf{w}^{(k)}_{p}\oplus(\mathbf{w}^{(k)}_{p})^{T}$ (the Kronecker sum) with the column vector

\mathbf{w}^{(k)}_{p}=\left[\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\|^{p-1}_{p}}\right]^{N}_{i=1}.

Now, recall that the definition of the classic $p$ -Laplacian is:

\displaystyle\Delta_{p}(\mathbf{F}):=-\frac{1}{2}\text{div}(\|\nabla\mathbf{F}\|^{p-2}\nabla\mathbf{F}),\quad\text{for}\,\,\,p\geq 1,

(27)

Our purpose is to show that our generalized algorithm Eq. (3.5) above can be implemented as the linear combination of the classic $p$ -Laplacian filtering. First, based on Fu et al (2022), the operator Eq. (27) in the original $p$ -Laplacian message passing algorithm can be written in detailed matrix form as follows

\displaystyle\Delta^{(k)}_{p}(\mathbf{F}^{(k)})=\left((\boldsymbol{\alpha}^{(k)}_{0})^{-1}-2\mu\mathbf{I}_{N}\right)\mathbf{F}^{(k)}-{\mathbf{G}^{(k)}}\mathbf{F}^{(k)},

(28)

where the matrix $\mathbf{G}^{(k)}$ has elements

\displaystyle G^{(k)}_{i,j}=

\displaystyle\frac{w_{i,j}}{\sqrt{d_{i,i}d_{j,j}}}\left\|\nabla_{W}\mathbf{F}^{(k)}[(i,j)]\right\|^{p-2},

and the diagonal matrix $\boldsymbol{\alpha}^{(k)}_{0}$ has diagonal element defined as

\displaystyle(\alpha^{(k)}_{0})_{ii}=

\displaystyle 1/\left(\sum_{v_{j}\sim v_{i}}{G^{(k)}_{i,j}}+2\mu/p\right).

Eq. (28) shows that the operation $\Delta^{(k)}_{p}(\mathbf{F}^{(k)})$ can be represented as the product of a matrix (still denoted as $\Delta^{(k)}_{p}$ ) and the signal matrix $\mathbf{F}^{(k)}$ .

Noting that $\mathbf{G}^{(k)}=\widetilde{\mathbf{M}}^{(k)}\oslash\mathbf{W}^{(k)}_{p}$ , multiplying diagonal matrix $\boldsymbol{\alpha}^{(k)}$ on both sides of Eq. (28) and taking out $\mathbf{F}^{(k)}$ give

\displaystyle\boldsymbol{\alpha}^{(k)}\Delta^{(k)}_{p}=\left(\boldsymbol{\alpha}^{(k)}(\boldsymbol{\alpha}^{(k)}_{0})^{-1}-2\mu\boldsymbol{\alpha}^{(k)}\right)-\boldsymbol{\alpha}^{(k)}(\widetilde{\mathbf{M}}^{(k)}\oslash\mathbf{W}^{(k)}_{p}).

As $\boldsymbol{\alpha}^{(k)}$ is diagonal, we can re-write the above equality as

\displaystyle\boldsymbol{\alpha}^{(k)}\widetilde{\mathbf{M}}^{(k)}=\left(\left(\boldsymbol{\alpha}^{(k)}(\boldsymbol{\alpha}^{(k)}_{0})^{-1}-2\mu\boldsymbol{\alpha}^{(k)}\right)-\boldsymbol{\alpha}^{(k)}\Delta^{(k)}_{p}\right)\odot\mathbf{W}^{(k)}_{p}.

Given that $\mathbf{W}^{(k)}_{p}$ is the Kronecker sum of the vector $\mathbf{w}^{(k)}_{p}$ and its transpose, if we still use $\mathbf{w}^{(k)}_{p}$ as its diagonal matrix, then we have

	$\displaystyle\boldsymbol{\alpha}^{(k)}\widetilde{\mathbf{M}}^{(k)}=$	$\displaystyle\left(\left(\boldsymbol{\alpha}^{(k)}/\boldsymbol{\alpha}^{(k)}_{0}-2\mu\boldsymbol{\alpha}^{(k)}\right)-\boldsymbol{\alpha}^{(k)}\Delta^{(k)}_{p}\right)\mathbf{w}^{(k)}_{p}$
	$\displaystyle+\mathbf{w}^{(k)}_{p}$	$\displaystyle\left(\left(\boldsymbol{\alpha}^{(k)}/\boldsymbol{\alpha}^{(k)}_{0}-2\mu\boldsymbol{\alpha}^{(k)}\right)-\boldsymbol{\alpha}^{(k)}\Delta^{(k)}_{p}\right).$		(29)

This demonstrates that the key terms $\boldsymbol{\alpha}^{(k)}\widetilde{\mathbf{M}}^{(k)}$ in our generalized algorithm is in linear form of $p$ -Laplacian operator $\Delta^{(k)}_{p}$ . As demonstrated in the research Fu et al (2022), the operator $\Delta^{(k)}_{p}$ approximately performs spectral graph convolution. Hence we can conclude that the generalized $p$ -Laplacian iterations (23) indeed performs a sequence of graph spectral convolutions (see Lemma 1 below) and gradient-based diagonal transformation (i.e., node feature scaling) over the framelet filtered graph signals.

Lemma 1.

The matrix $\Delta^{(k)}_{p}$ is SPD (see Bozorgnia et al (2019); Ly (2005)) which has its own eigendecomposition, offering a graph spectral convolution interpretation.

Remark 3.

Eq. (3.5) indicates that the $p$ -Laplacian regularizer provides graph spectral convolution on top of the framelet filtering, which produces a second layer of filtering conceptually and hence restricts the solution space further. See Fig. 1 for the illustration. Interestingly, the combination of $p$ -Laplacian regularization and framelet offers a more adaptive smoothness that suites both homophilic and heterophilic data as shown in our experiments.

Remark 4.

In terms of asymptotic behavior of Eq. (3.5), one can show roughly that the elements in $\boldsymbol{\alpha}^{(k)}\widetilde{\mathbf{M}}^{(k)}$ are between 0 and 1, which converges to zero if $K$ is too large. Therefore a large $K$ annihilates the first term in Eq. (3.5) but leaves the second term and partial sum from the third. A larger value of $\mu$ seems to speed up this convergence further, resulting shortening the time for finding the solution. However, it is a model selection problem for generalizability. Moreover, the inclusion of the source term $\boldsymbol{\beta}^{(K-1)}\mathbf{Y}$ guarantees to supply certain amount of variations of the node features from both low and high frequency domains of the graph framelet and it has been shown such inclusion can help the model escape from over-smoothing issue (Chien et al, 2021).

Remark 5.

Compared to GPRGNN (Chien et al, 2021) in which the model outcome is obtained by

\displaystyle\hat{\mathbf{F}}^{(k)}=\sum_{k}^{K-1}\boldsymbol{\gamma}_{k}\mathbf{F}^{(k)},

where $\boldsymbol{\gamma}_{k}\in\mathbb{R}^{N\times N}$ is learnable generalized Page rank coefficients, as we have shown in Eq. (3.5), our proposed algorithm defined in Eq. (23) provides a mixed (spectral convolution and diagonal scaling) to the framelet graph signal outputs and thus further directs the solution space of framelet to a reasonably defined region. Furthermore, due to the utilization of the Chebyshev polynomial in framelet, the computational complexity for the first filtering is not high, which is helpful in terms of defining the provably corrected space for the second operation. This shows the efficiency of incorporating framelet in $p$ -Laplacian based regularization.

4 Discussions on the Proposed Model

In this section, we conduct comprehensive discussions for our proposed models. We note that we will mainly focused on exploring the property of the generalized $p$ -Laplacian based framelet GNN (Eq. (22)) and the iterative algorithm in Eq. (23), although we may also show some conclusions of pL-UFG and pL-fUFG as well. Specifically, in Section 4.1 we discuss the denoising power of our proposed model. Section 4.2 analyzes the model’s computational complexity. Section 4.3 provides a comparison between the $p$ -Laplacian regularizer with other regularizers that are applied to GNNs via different learning tasks. Finally, the section is concluded (Section 4.4) by illustrating the limitation of the model and potential aspects for future studies.

4.1 Discussion on the Denoising Power

The denoising power of framelet has been developed in the study(Dong, 2017) and empirically studied by the past works Dong (2017); Zhou et al (2021a); Yang et al (2022). Let $\mathbf{X}=\mathbf{F}+\mathbf{\tau}$ be the input node feature matrix with noise $\mathbf{\tau}$ . In works (Dong, 2017; Zhou et al, 2021a), a framelet-based regularizer was proposed to resolve the optimization problem which can be described as:

\displaystyle\min_{\mathbf{F}}\|\mathcal{D}\mathbf{F}\|_{1,\mathcal{G}}+f(\mathbf{F},\mathbf{X}),

(30)

where $f(\cdot,\cdot)$ is some fidelity function that takes different forms for different applications. For any graph signal $\mathbf{F}$ , the graph $\mathbb{L}_{p}$ -norm $\|\mathbf{F}\|_{p,\mathcal{G}}:=\left(\sum_{i}|\mathbf{f}_{i}|^{p}\times d_{i,i}\right)^{1/p}$ , where $d_{i,i}$ is the degree of the $i$ -th node with respect to $\mathbf{f}$ . $\mathcal{D}$ is a linear transformation generated from discrete transformations, for example, the graph Fourier transforms or the Framelet transforms. Thus, the regularizer assigns a higher penalty to the node with a larger number of neighbours. Specifically, considering the functional $\mathcal{D}$ as the framelet transformation, then the first term in Eq. (30) can be written as:

\displaystyle\|\mathcal{D}\mathbf{F}\|_{1,\mathcal{G}}=\left(\sum_{(r,\ell)\in\mathcal{Z}}\sum_{i}|\mathbf{F}_{r,\ell}[i]|^{p}\times d_{i,i}\right)^{1/p}.

(31)

where $\mathcal{Z}=\{(r,\ell):r=1,...,R,\ell=0,...,J\}\cup\{(0,J)\}$ . Replacing the $p$ -Laplacian regularizer in Eq. (15) by the framelet regularizer defined in the above Eq. (31) we have:

\displaystyle\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}

\displaystyle\left(\sum_{(r,\ell)\in\mathcal{Z}}\sum_{i}|\mathbf{F}_{r,\ell}[i]|^{p}\times d_{i,i}\right)^{1/p}+\mu\|\mathbf{F}-\mathcal{W}^{T}\textrm{diag}(\theta)\mathcal{W}\mathbf{X}\|^{2}_{F}.

(32)

Followed by the work in the research (Dong, 2017), the framelet regularizer-based denoising problem in Eq. (30) is associated with the variational regularization problem that can be generally described in the form similar to Eq. (15) where the variational term $\mathcal{S}_{p}(\mathbf{F})$ is utilized for regularizing the framelet objective function. Simply by replacing the input of the first term in Eq. (32) by $\nabla\mathbf{F}$ and omitting node degree information, we have:

\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}\sum_{(r,\ell)\in\mathcal{Z}}\|\nabla\mathbf{F}_{r,\ell}\|_{p}^{p}+\mu\|\mathbf{F}-\mathcal{W}^{T}\textrm{diag}(\theta)\mathcal{W}\mathbf{X}\|^{2}_{F}.

(33)

Therefore our proposed model naturally is equipped with denoising capability and the corresponding (denoising) regularizing term aligns with the denoising regularizer developed in the paper (Zhou and Schölkopf, 2005) without nodes degree information. However, the work in the study (Zhou and Schölkopf, 2005) only handles $p=1$ and $p=2$ . Whereas our $p$ -Laplacian framelet model covers the values of $p\in\mathbb{R}_{+}$ . This allows more flexibility and effectiveness in the model.

Furthermore, the major difference between most wavelet frame models and the classic variational models is the choice of the underlying transformation (i.e., $\mathcal{D}$ applied to the graph signal) that maps the data to the transformed domain that can be sparsely approximated. Given both $p$ -Laplacian (i.e., Eq. (15)) and framelet regularizers (i.e., Eq. (31)) target on the graph signal $\mathbf{F}_{\mathcal{Z}}$ that is produced from the sparse and tight framelet transformation (Chebyshev polynomials), this further illustrates the equivalence between two denoising regularizers. Please refer to (Dong, 2017) for more details.

4.2 Discussion on the Computational Complexity

In this section, we briefly discuss the computational complexity of our proposed model, specific to the generalized model defined in Eq. (22) with its results that can be approximated by the algorithm presented in Eq. (23). The primary computational cost of our model stems from the generation of the framelet decomposition matrices ( $\mathcal{W}$ ) or the framelet transformation applied to the node features. We acknowledge that generating $\mathcal{W}$ involves a high computational complexity, as transitional framelet methods typically employ eigen-decomposition of the graph Laplacian for this purpose. Consequently, the framelet transform itself exhibits a time complexity of $\mathcal{O}(N^{2}(nj+1)Kc)$ and a space complexity of $\mathcal{O}(N^{2}(nj+1)c)$ for a graph with $N$ nodes and $c$ -dimensional features. Notably, the constants $n$ , $j$ , and $K$ are independent of the graph data. Existing literature (Zheng et al, 2021b) has demonstrated that when the input graph contains fewer than $4\times 10^{4}$ nodes (with fixed feature dimension), the computational time for framelet convolution is comparable to that of graph attention networks with 8 attention heads (Veličković et al, 2018). As the graph’s node size increases, the performance of graph attention networks degrades rapidly, while graph framelets maintain faster computation. However, our application of Chebyshev polynomials to approximate $\mathcal{W}$ significantly reduces the associated computational cost compared to traditional methods. Additionally, we acknowledge that the inclusion of the $p$ -Laplacian based implicit layer introduces additional computational cost to the original framelet model, primarily arising from the computation of the norm of the graph gradient, denoted as $|\nabla\mathbf{F}|$ . Considering the example of the Euclidean norm, the computational cost for $|\nabla\mathbf{F}|$ scales as $\mathcal{O}(N\times c)$ , where $N$ and $c$ represent the number of nodes and feature dimension, respectively. Thus, even when accounting for the computation of the implicit layer, the overall cost of our proposed method remains comparable to that of the framelet model.

4.3 Comparison with Other Regularizers and Potential Application Scenarios

As we have illustrated in Section 3.3, with the assistance from different form of the regularizers, GNNs performance could be enhanced via different learning tasks. In this discussion, we position our study within the wider research landscape that investigates various regularizers to enhance the performance of Graph Neural Networks (GNNs) across different learning tasks. We earlier discussed the denoising power of the pL-UFG in Section 4.1, establishing that it can be expressed in terms of a denoising formulation. This is comparable to the approach from Ma et al (2021), who used a different regularizer term to highlight the denoising capabilities of GNNs. They showed that their regularizer can effectively reduce noise and enhance the robustness of GNN models.

Our study, however, emphasizes the unique advantages that the $p$ -Laplacian brings, a theme also echoed in the works by Liu et al (2010) and Candès et al (2011). Both studies incorporated the $L_{1}$ -norm as a regularization term in robust principal component analysis (RPCA), showcasing its ability to recover low-rank matrices even in the presence of significant noise. Furthermore, the study by Liu et al (2021b) reinforces the benefits of the $L_{1}$ -norm in preserving discontinuity and enhancing local smoothness. Characterized by its heavy-tail property, the $L_{1}$ -norm imposes less penalty on large values, thereby making it effective in handling data with substantial corruption or outliers.

Furthermore, in terms of the potential practical implementations, one can deploy our method into various aspects. For example, the $p$ -Laplacian based regularizer can be used in image processing and computer vision applications, where it can help to smooth out noisy or jagged images while preserving important features. In addition, we note that, as we implement our graph learning method via an optimization (regularization) framework, this suggests any potential practical implementations (such as graph generation (Ho et al, 2020), graph based time series forecasting (Wen et al, 2023)) of GNNs can also be deployed under our proposed methods with a higher flexibility power.

4.4 Limitation and Potential Future Studies

While outstanding properties have been presented from our proposed model, the exploration of how the $p$ -Laplacian based regularization framework can be deployed to various types of graphs (i.e., dynamic or spatio-temporal) is still wanted. In fact, one may require the corresponding GNNs to be able to capture the pattern of the evolution (i.e., combing GNN with LSTM or Transformer (Manessi et al, 2020)) of the graph when the input graph is dynamic. We note that such a requirement is beyond the capability of the current model. However, as the $p$ -Laplacian based regularizer restricts the solution of the graph framelet, it would be interesting to make the quantity of $p$ learnable throughout the training process. We leave this to future work.

Moreover, followed by the idea of making $p$ value as a learnable parameter in the model, one can also explore assigning different values of $p$ to different frequency domains of framelet. It has been verified that the low-frequency domain of framelet induces a smoothing effect on the graph signal whereas the sharpening effect is produced from the high-frequency domain (Han et al, 2022). Therefore, one may prefer to obtain a relatively large quantity of $p$ to the low-frequency domain to enhance the smoothing effect of the framelet when the input graph is highly homophily, as one prefers to predict identical labels for connected nodes. On the other hand, a smaller value of $p$ is wanted for a high-frequency domain to further sharpen the differences between node features (i.e., $\nabla\mathbf{F}$ ) so that distinguished labels can be produced from the model when the input graph is heterophilic. Moreover, as the $p$ -Laplacian-based implicit layers are allocated before the framelet reconstruction, it would be interesting to explore how such implicit layers can affect the signal reconstruction of the framelet.

5 Experiments

In this section, we present empirical studies on pL-UFG and pL-fUFG on real-world node classification tasks with both heterophilic and homophilic graphs. We also test the robustness of the proposed models against noise. Both two experiments are presented with detailed discussions on their results. In addition, we discuss by adjusting the quantity of the $p$ in our proposed model, the so-called ablation study is automatically conducted in our experiments. The code for our experiment can be accessed via https://github.com/superca729/pL-UFG. Lastly, it is worth noting that our proposed method has the potential to be applied to the graph learning task other than node classification such as graph level classification (pooling) (Zheng et al, 2021b) and link prediction (Lin and Gao, 2023). Although we have yet to delve into these tasks, we believe that by assigning some simple manipulations to our methods, such as deploying the readout function for graph pooling or computing the log-likelihood for graph link prediction, our method is capable of handling these tasks. Similarly, our model could be beneficial in community detection tasks by possibly identifying clusters of similar characteristics or behaviors. We leave these promising research aspects to our future work.

5.1 Datasets, Baseline Models and the Parameter Setting

Datasets. We use both homophilic and heterophilic graphs from https://www.pyg.org/ to assess pL-UFG and pL-fUFG, including benchmark heterophilic datasets: Chameleon, Squirrel, Actor, Wisconsin, Texas, Cornell, and homophilic datasets: Cora, CiteSeer, PubMed, Computers, Photos, CS, Physics. In our experiments, homophilic graphs are undirected and heterophilic graphs are directed (where we observe an improved performance when direction information is provided). We included the summary statistics of the included datasets together with their homophily index and split ratio in Table 1

Baseline models. We consider eight baseline models for comparison:

•

MLP: standard feedward multiple layer perceptron.
•

GCN (Kipf and Welling, 2017): GCN is the first of its kind to implement linear approximation to spectral graph convolutions.
•

SGC (Wu et al, 2019): SGC reduces GCNs’ complexity by removing nonlinearities and collapsing weight matrices between consecutive layers.
•

GAT (Veličković et al, 2018): GAT is a graph neural network that applies the attention mechanism on node feature to learn edge weight.
•

JKNet (Xu et al, 2018b): JKNet can flexibly leverage different neighbourhood ranges to enable better structure-aware representation for each node.
•

APPNP (Gasteiger et al, 2019): APPNP combines GNN with personalized PageRank to separate the neural network from the propagation scheme.
•

GPRGNN (Chien et al, 2021): GPRGNN architecture that adaptively learns the GPR (General Pagerank) weights so as to jointly optimize node feature and topological information extraction, regardless the level of homophily on a graph.
•

$p$ -GNN (Fu et al, 2022): $p$ -GNN is $p$ -Laplacian based GNN model, whose message-passing mechanism is derived from a discrete regularization framework.
•

UFG (Zheng et al, 2021b): UFG is a type of GNNs based on framelet transforms, the framelet decomposition can naturally aggregate the graph features into low-pass and high-pass spectra.

The test results are reproduced using code that runs in our machine, which might be different from the reported results. In addition, compared to $p$ -GNN, the extra time and space complexity induced from our algorithm is $O(N^{2}(RJ+1)d)$ .

Hyperparameter tuning. We use grid search for tuning hyperparameters. We test $p\in\{1.0,1.5,2.0,2.5\}$ for PGNN, pL-UFG and pL-fUFG. The learning rate is chosen from {0.01, 0.005}. We consider the number of iterations $T$ in $p$ -Laplacian message passing from $\{4,5\}$ after 10 warming-up steps. For homophilic datasets, we tune $\mu\in\{0.1,0.5,1,5,10\}$ and for heterophilic graphs $\mu\in\{3,5,10,20,30,50,70\}$ . The framelet type is fixed as Linear (see (Yang et al, 2022)) and the level $J$ is set to 1. The dilation scale ${s}\in\{1,1.5,2,3,6\}$ , and for $n$ , the degree of Chebyshev polynomial approximation to all $g$ ’s in (15)-(18), is drawn from {2, 3, 7}. It is worth noting that in graph framelets, the Chebyshev polynomial is utilized for approximating the spectral filtering of the Laplacian eigenvalues. Thus, a $d$ -degree polynomial approximates $d$ -hop neighbouring information of each node of the graph. Therefore, when the input graph is heterophilic, one may require $d$ to be relatively larger as node labels tend to be different between directly connected (1-hop) nodes. The number of epochs is set to 200, the same as the baseline model(Fu et al, 2022).

Table 1: Statistics of the datasets,

\mathcal{H}(\mathcal{G})

represent the level of homophily of overall benchmark datasets.

Datasets	Class	Feature	Node	Edge	Train/Valid/Test	$\mathcal{H}(\mathcal{G})$
Cora	7	1433	2708	5278	20%/10%/70%	0.825
CiteSeer	6	3703	3327	4552	20%/10%/70%	0.717
PubMed	3	500	19717	44324	20%/10%/70%	0.792
Computers	10	767	13381	245778	20%/10%/70%	0.802
Photo	8	745	7487	119043	20%/10%/70%	0.849
CS	15	6805	18333	81894	20%/10%/70%	0.832
Physics	5	8415	34493	247962	20%/10%/70%	0.915
Chameleon	5	2325	2277	31371	60%/20%/20%	0.247
Squirrel	5	2089	5201	198353	60%/20%/20%	0.216
Actor	5	932	7600	26659	60%/20%/20%	0.221
Wisconsin	5	251	499	1703	60%/20%/20%	0.150
Texas	5	1703	183	279	60%/20%/20%	0.097
Cornell	5	1703	183	277	60%/20%/20%	0.386

5.2 Experiment Results and Discussion

Node Classification Tables 2 and 3 summarize the results on homophilic and heterophilic datasets. The values after $\pm$ are standard deviations. The top three results are highlighted in First, Second, and Third. From the experiment results, we observe that pL-UFG and pL-fUFG achieve competitive performances against the baselines in most of the homophily and heterophily benchmarks. For the models’ performance on homophily undirect graphs, Table 2 shows that pL-UFG2^1.5 has the top accuracy in Cora, Photos and CS. For Citeseer, pL-UFG1^2.0, pL-UFG2^2.0 and pL-fUFG^2.0 have the best performance. In terms of the performance on heterophily direct graphs (Table 3), we observe that both of our two models are capable of generating the highest accuracy in all benchmark datasets except Actor where the top performance was generated from MLP. However, we note that for Actor, our pL-UFG1^1.0 achieves almost identical outcomes compared to those of MLP.

Aligned with our theoretical prediction in Remark 4 from the experiment, we discover that the value of the trade-off term $\mu$ for pL-UFG and pL-fUFG is significantly higher than that in pGNN, indicating that framelet has a major impact on the performance of the model. The performance of pL-UFG, pL-fUFG and others on heterophilic graphs are shown in Table 3. pL-UFG and pL-fUFG both can outperform MLP and other state-of-the-art GNNs under a low homophilic rate. In terms of denoising capacity, pL-UFG and pL-fUFG are far better than the baseline models. Figs. 6 and 7 show that both pL-UFG and pL-fUFG produce the top accuracies across different noise levels.

Discussion for $\mathbf{p<2}$ . To enable GNN model to better adapt to the heterophilic graphs, an ideal GNN model shall induce a relatively high variation in terms of the node feature generated from it. Therefore, compared to the models with $p=2$ , the model regularized with $p\neq 2$ regularizer imposes a lesser penalty and thus produces outputs with a higher variation. This is supported by the empirical observation from Table 3 in which the highest prediction accuracy for heterophilic graphs are usually achieved by our model with $p<2$ .

Discussion for $p$ = 1. Here we specifically focus on $p=1$ in our models. Recall that when $p=1$ , the $p$ -Laplacian operator acts on the graph signal $\mathbf{F}$ is $\Delta_{1}\mathbf{F}=-\frac{1}{2}\text{div}\left(\frac{\nabla\mathbf{F}}{\|\nabla\mathbf{F}\|}\right)$ . Based on Remark 1, when $p=1$ , $\Delta_{1}\mathbf{F}$ is equivalent to the mean curvature operator defined on the embedded surface. In analogy to differential geometry, points that curvature can not be properly defined are so-called singular points. Thus one can similarly conclude that when a graph contains a singular node (i.e., the graph is crumpled (Burda et al, 2001)) then both $\Delta_{1}\mathbf{F}$ and its corresponding $\mathcal{S}_{p}(\mathbf{F})$ can not be properly defined. Furthermore, one can easily check that, when $p=1$ , the regularization term $\phi(\xi)=\xi^{p}$ produces a higher penalty than its $\mathcal{S}_{p}(\mathbf{F})$ counterparts. This is because, when $p=1$ , $\mathcal{S}_{p}(\mathbf{F})=(\sum_{i}^{c}(\nabla\mathbf{f}_{i})^{2})^{\frac{1}{2}}$ whereas $\phi(\xi)=\xi^{1}=\sum_{i}|\nabla\mathbf{f}_{i}|$ . Clearly, we have $\mathcal{S}_{p}(\mathbf{F})<\phi(\xi)$ unless all of the nodes in the graph are singular nodes, or in the graph in which there is only one non-singular point while the rest nodes are singular.

5.3 Visualization on the Effect of $p$

Based on the claim we made in previous sections, increasing the value of $p$ leads our model to exhibit a stronger smoothing effect on the node features, thereby making it more suitable for homophilic graphs. Conversely, when $p$ is small, the model preserves distinct node features, making it a better fit for heterophilic graphs. To validate this idea, we visualize the changes in relative positions (distances) between node features for $p=1$ (the minimum value) and $p=2.5$ (the maximum value). We specifically selected the Cora and Actor datasets and employed Isomap (Tenenbaum et al, 2000) to map both the initial node features and the node features generated by our model to $\mathbb{R}^{2}$ , while preserving their distances. The results are depicted in Figure 2 and 3.

From both Figure 2 and 3, it can be observed that when $p=1$ , the sharpening effect induced by our model causes the node features to become more distinct from each other. Under the Isomap projection, the nodes are scattered apart compared to the input features. Conversely, when a relatively large value of $p=2.5$ is assigned, the model exhibits a stronger smoothing effect, resulting in all nodes being aggregated towards the center in the Isomap visualization. These observations provide direct support for our main claim regarding the proposed model.

Table 2: Test accuracy (%) on homophilic undirected graph.

Method	Cora	CiteSeer	PubMed	Computers	Photos	CS	Physics
MLP	$66.04\pm 1.11$	$68.99\pm 0.48$	$82.03\pm 0.24$	$71.89\pm 5.36$	$86.11\pm 1.35$	$93.50\pm 0.24$	$94.56\pm 0.11$
GCN	$84.72\pm 0.38$	$75.04\pm 1.46$	$83.19\pm 0.13$	$78.82\pm 1.87$	$90.00\pm 1.49$	$93.00\pm 0.12$	$95.55\pm 0.09$
SGC	$83.79\pm 0.37$	$73.52\pm 0.89$	$75.92\pm 0.26$	$77.56\pm 0.88$	$86.44\pm 0.35$	OOM	OOM
GAT	$84.37\pm 1.13$	$74.80\pm 1.00$	$83.92\pm 0.28$	$78.68\pm 2.09$	$89.63\pm 1.75$	$92.57\pm 0.14$	$95.13\pm 0.15$
JKNet	$83.69\pm 0.71$	$74.49\pm 0.74$	$82.59\pm 0.54$	$69.32\pm 3.94$	$86.12\pm 1.12$	$91.11\pm 0.22$	$94.45\pm 0.33$
APPNP	$83.69\pm 0.71$	$75.84\pm 0.64$	$80.42\pm 0.29$	$73.73\pm 2.49$	$87.03\pm 0.95$	$91.52\pm 0.14$	$94.71\pm 0.11$
GPRGNN	$83.79\pm 0.93$	$75.94\pm 0.65$	$82.32\pm 0.25$	$74.26\pm 2.94$	$88.69\pm 1.32$	$91.89\pm 0.08$	$94.85\pm 0.23$
UFG	$80.64\pm 0.74$	$73.30\pm 0.19$	$81.52\pm 0.80$	$66.39\pm 6.09$	$86.60\pm 4.69$	$95.27\pm 0.04$	$95.77\pm 0.04$
PGNN^1.0	$84.21\pm 0.91$	$75.38\pm 0.82$	$84.34\pm 0.33$	$81.22\pm 2.62$	$87.64\pm 5.05$	$94.88\pm 0.12$	$96.15\pm 0.12$
PGNN^1.5	$84.42\pm 0.71$	$75.44\pm 0.98$	$84.48\pm 0.21$	$82.68\pm 1.15$	$91.83\pm 0.77$	$94.13\pm 0.08$	$96.14\pm 0.08$
PGNN^2.0	$84.74\pm 0.67$	$75.62\pm 1.07$	$84.25\pm 0.35$	$83.40\pm 0.68$	$91.71\pm 0.93$	$94.28\pm 0.10$	$96.03\pm 0.07$
PGNN^2.5	$84.48\pm 0.77$	$75.22\pm 0.73$	$83.94\pm 0.47$	$82.91\pm 1.34$	$91.41\pm 0.66$	$93.40\pm 0.07$	$95.75\pm 0.05$
pL-UFG1^1.0	$84.54\pm 0.62$	$75.88\pm 0.60$	$85.56\pm 0.18$	$82.07\pm 2.78$	$85.57\pm 19.92$	$\mathbf{95.03\!\pm\!0.22}$	$\mathbf{96.19\!\pm\!0.06}$
pL-UFG1^1.5	$84.96\pm 0.38$	$76.04\pm 0.85$	$\mathbf{85.59\!\pm\!0.18}$	$85.04\pm 1.06$	$\mathbf{92.92\!\pm\!0.37}$	$\mathbf{95.03\pm 0.22}$	$\mathbf{96.27\!\pm\!0.06}$
pL-UFG1^2.0	$85.20\pm 0.42$	$\mathbf{76.12\pm 0.82}$	$\mathbf{85.59\!\pm\!0.17}$	$\mathbf{85.26\!\pm\!1.15}$	$\mathbf{92.65\!\pm\!0.65}$	$94.77\pm 0.27$	$96.04\pm 0.07$
pL-UFG1^2.5	$85.30\pm 0.60$	$\mathbf{76.11\!\pm\!0.82}$	$85.54\pm 0.18$	$85.18\pm 0.88$	$91.49\pm 1.29$	$94.86\pm 0.14$	$95.96\pm 0.11$
pL-UFG2^1.0	$84.42\pm 0.32$	$74.79\pm 0.62$	$85.45\pm 0.18$	$84.88\pm 0.84$	$85.30\pm 19.50$	$\mathbf{95.03\!\pm\!0.19}$	$96.06\pm 0.11$
pL-UFG2^1.5	$\mathbf{85.60\!\pm\!0.36}$	$75.61\pm 0.60$	$\mathbf{85.59\!\pm\!0.18}$	$84.55\pm 1.57$	$\mathbf{93.00\!\pm\!0.61}$	$\mathbf{95.03\!\pm\!0.19}$	$96.14\pm 0.09$
pL-UFG2^2.0	$85.20\pm 0.42$	76.12 $\pm$ 0.82	$\mathbf{85.59\!\pm\!0.17}$	$\mathbf{85.27\!\pm\!1.15}$	$92.50\pm 0.40$	$94.77\pm 0.27$	$96.05\pm 0.07$
pL-UFG2^2.5	$\mathbf{85.31\!\pm\!0.41}$	$75.88\pm 0.67$	$85.53\pm 0.19$	$85.11\pm 0.81$	$92.44\pm 1.51$	$\mathbf{94.96\!\pm\!0.13}$	$95.99\pm 0.12$
pL-fUFG^1.0	$84.45\pm 0.43$	$75.64\pm 0.64$	$\mathbf{85.62\!\pm\!0.19}$	$84.57\pm 1.08$	$86.09\pm 16.35$	$\mathbf{94.98\!\pm\!0.23}$	$\mathbf{96.23\!\pm\!0.09}$
pL-fUFG^1.5	$\mathbf{85.40\!\pm\!0.45}$	$75.94\pm 0.63$	$\mathbf{85.62\!\pm\!0.17}$	$85.00\pm 1.21$	$92.12\pm 1.35$	94.98 $\pm$ 0.23	$96.12\pm 0.08$
pL-fUFG^2.0	$85.20\pm 0.42$	$\mathbf{76.12\!\pm\!0.82}$	$\mathbf{85.58\!\pm\!0.16}$	$\mathbf{85.29\!\pm\!1.16}$	$92.47\pm 0.48$	$94.77\pm 0.27$	$96.05\pm 0.07$
pL-fUFG^2.5	$85.21\pm 0.44$	$76.01\pm 0.97$	$85.54\pm 0.20$	$84.94\pm 0.91$	$92.26\pm 1.26$	$94.95\pm 0.11$	$96.02\pm 0.04$

Table 3: Test accuracy (%) on heterophilic directed graph

Method	Chameleon	Squirrel	Actor	Wisconsin	Texas	Cornell
MLP	$48.82\pm 1.43$	$34.30\pm 1.13$	$\mathbf{41.66\!\pm\!0.83}$	$93.45\pm 2.09$	$71.25\pm 12.99$	$83.33\pm 4.55$
GCN	$33.71\pm 2.27$	$26.19\pm 1.34$	$33.46\pm 1.42$	$67.90\pm 8.16$	$53.44\pm 11.23$	$55.68\pm 10.57$
SGC	$33.83\pm 1.69$	$26.89\pm 0.98$	$32.08\pm 2.22$	$59.56\pm 11.19$	$64.38\pm 7.53$	$43.18\pm 16.41$
GAT	$41.95\pm 2.65$	$25.66\pm 1.72$	$33.64\pm 3.45$	$60.65\pm 11.08$	$50.63\pm 28.36$	$34.09\pm 29.15$
JKNet	$33.50\pm 3.46$	$26.95\pm 1.29$	$31.14\pm 3.63$	$60.42\pm 8.70$	$63.75\pm 5.38$	$45.45\pm 9.99$
APPNP	$34.61\pm 3.15$	$32.61\pm 0.93$	$39.11\pm 1.11$	$82.41\pm 2.17$	$80.00\pm 5.38$	$60.98\pm 13.44$
GPRGNN	$34.23\pm 4.09$	$34.01\pm 0.82$	$34.63\pm 0.58$	$86.11\pm 1.31$	$84.38\pm 11.20$	$66.29\pm 11.20$
UFG	$50.11\pm 1.67$	$31.48\pm 2.05$	$40.13\pm 1.11$	$93.52\pm 2.36$	$84.69\pm 4.87$	$83.71\pm 3.28$
PGNN^1.0	$49.04\pm 1.16$	$34.79\pm 1.01$	$40.91\pm 1.41$	$94.35\pm 2.16$	$82.00\pm 11.31$	$82.73\pm 6.92$
PGNN^1.5	$49.12\pm 1.14$	$34.86\pm 1.25$	$40.87\pm 1.47$	$94.72\pm 1.91$	$81.50\pm 10.70$	$81.97\pm 10.16$
PGNN^2.0	$49.34\pm 1.15$	$34.97\pm 1.41$	$40.83\pm 1.81$	$94.44\pm 1.75$	$84.38\pm 11.52$	$81.06\pm 10.18$
PGNN^2.5	$49.16\pm 1.40$	$34.94\pm 1.57$	$40.78\pm 1.51$	$94.35\pm 2.16$	$83.38\pm 12.95$	$81.82\pm 8.86$
pL-UFG1^1.0	$\mathbf{56.81\pm 1.69}$	$38.81\pm 1.97$	$\mathbf{41.26\pm 1.66}$	$\mathbf{96.48\pm 0.94}$	$86.13\pm 7.47$	$86.06\pm 3.16$
pL-UFG1^1.5	$\mathbf{56.89\pm 1.17}$	$\mathbf{39.73\pm 1.22}$	$40.95\pm 0.93$	$\mathbf{96.48\pm 1.07}$	$87.00\pm 5.16$	$\mathbf{86.52\pm 2.29}$
pL-UFG1^2.0	$56.24\pm 1.02$	$39.72\pm 1.86$	$40.95\pm 0.93$	$\mathbf{96.59\pm 0.72}$	$86.50\pm 8.84$	$85.30\pm 2.35$
pL-UFG1^2.5	$56.11\pm 1.25$	$39.38\pm 1.78$	$41.04\pm 0.99$	$95.34\pm 1.64$	$\mathbf{89.00\pm 4.99}$	$83.94\pm 3.53$
pL-UFG2^1.0	$55.51\pm 1.53$	$36.94\pm 5.69$	$29.28\pm 19.25$	$93.98\pm 2.94$	$85.00\pm 5.27$	$\mathbf{87.73\pm 2.49}$
pL-UFG2^1.5	$\mathbf{57.22\pm 1.19}$	$\mathbf{39.80\pm 1.42}$	$40.89\pm 0.75$	$\mathbf{96.48\pm 0.94}$	$87.63\pm 5.32$	$86.82\pm 1.67$
pL-UFG2^2.0	$56.19\pm 0.99$	$\mathbf{39.74\pm 1.66}$	$41.01\pm 0.80$	$96.14\pm 1.16$	$86.50\pm 8.84$	$85.30\pm 2.35$
pL-UFG2^2.5	$55.69\pm 1.15$	$39.30\pm 1.68$	$40.86\pm 0.74$	$95.80\pm 1.44$	$86.38\pm 2.98$	$84.55\pm 3.31$
pL-fUFG^1.0	$55.80\pm 1.93$	$38.43\pm 1.26$	$32.84\pm 16.54$	$93.98\pm 3.47$	$86.25\pm 6.89$	$\mathbf{87.27\pm 2.27}$
pL-fUFG^1.5	$55.65\pm 1.96$	$38.40\pm 1.52$	$41.00\pm 0.99$	$\mathbf{96.48\pm 1.29}$	$87.25\pm 3.61$	$86.21\pm 2.19$
pL-fUFG^2.0	$55.95\pm 1.29$	$38.33\pm 1.71$	$\mathbf{41.25\pm 0.84}$	$\mathbf{96.25\pm 1.25}$	$\mathbf{88.75\pm 4.97}$	$83.94\pm 3.78$
pL-fUFG^2.5	$55.56\pm 1.66$	$38.39\pm 1.48$	$40.55\pm 0.50$	$95.28\pm 2.24$	$\mathbf{88.50\pm 7.37}$	$83.64\pm 3.88$

5.4 Experiments on Denoising Capacity

Followed by the discussion in Section 4.1, in this section we evaluate our proposed models’ (pL-UFG and pL-fUFG) denoising capacity on both homophilic (Cora) and heterophilic (Chameleon) graphs. Since the node features of the included datasets are in binary, we randomly assign binary noise with different proportions of the node features (i.e., $r\in\{5\%,10\%,15\%,20\%\}$ ). From Fig. 6 and 7 we see that pL-UFG $2^{1.5}$ defined in Eq. (16) outperforms other baselines and showed the strongest robustness to the contaminated datasets. This is expected based on our discussion on the denoising power in Section 4.1 and the formulation of our models (defined in Eqs. (15), (16) and (17)). The denoising capacity of our proposed model is sourced from two parts including the sparse approximation of the framelet decomposition and reconstruction, i.e., $\mathcal{W}$ and $\mathcal{W}^{T}$ , and variational regularizer $\mathcal{S}_{p}(\mathbf{F})$ (Dong, 2017). Compared to pL-UFG1 defined in Eq. (15), pL-UFG2 assigns the denoising operators, $\mathcal{W}$ , $\mathcal{W}^{T}$ and $\mathcal{S}_{p}(\mathbf{F})$ , to the node inputs that is reconstructed from both sparsely decomposed low and high-frequency domains so that the denoising operators target on the original graph inputs at different scales and hence naturally result in a better model denoising performance. In addition, without taking the reconstruction in the first place, the term in pL-fUFG $\|\mathbf{F}_{r,j}-\textrm{diag}(\theta_{r,j})\mathcal{W}_{r,j}\mathbf{X}\|^{2}_{F}$ is less sparse than the pL-UFG2 counterpart. Thus regularizing with the same $\mathcal{S}_{p}(\mathbf{F})$ , the effect from insufficient eliminated noise from pL-fUFG will lead pL-fUFG to an incorrect (respect to pL-UFG2) solution space which can not be recovered by the additional reconstruction using $\mathcal{W}^{T}$ , and this is the potential reason for observing an outperforming result from pL-UFG2 compared to other proposed models.

In addition to the above experiment, here we show how the results of how the quantity of $p$ affect the denoising power of the proposed model, while keeping all other parameters constant. The results of the changes of the denoising power are included in Fig. 4 (homophilic graph) and 5 (heterophilic graph).

Based on the observations from the results, it appears that the performance of different values of $p$ varies under different noise rates. For the homophilic graph, the performance of pL-UFG2^1.5 and pL-UFG2^2.0 seems to be relatively stable across different noise rates, while the performance of other values of $p$ fluctuates more. For the heterophilic graph, the performance of pL-UFG1^2.5 and pL-UFG2^2.5 appears to be relatively stable across different noise rates, while the performance of other values of $p$ fluctuates more.

We also note that these observations on the fluctuations of the results are expected, as we have shown that the denoising process of our proposed model can be presented as:

\mathbf{F}\!=\!\operatorname*{arg\,min}_{\mathbf{F}}\!\sum_{(r,\ell)\in\mathcal{Z}}\|\nabla\mathbf{F}_{(r,\ell)}\|_{p}^{p}+\!\mu\|\mathbf{F}\!-\!\mathcal{W}^{T}\!\textrm{diag}(\!\theta\!)\mathcal{W}\mathbf{X}\|^{2}_{F}.

(34)

where $\mathcal{Z}=\{(r,\ell):r=1,...,R,\ell=1,...,J\}\cup\{(0,J)\}$ is the set of indices of all framelet decomposed frequency domains. It is not hard to verify that once a larger quantity of $p$ is assigned, the penalty on the node feature difference ( $\nabla\mathbf{F}$ ) becomes to be greater. Therefore, a stronger denoising power is induced. In terms of different graph types, when the input graph is heterophily, in most of the cases, the connected nodes tend to have distinct features, after assigning noise to those features, if the feature difference becomes larger, a higher quantity of $p$ is preferred. In the meanwhile, adding noise for the heterophilic graph could also make the feature difference becomes smaller, in this case a large quantity of $p$ may not be appropriate, this explains why most of lines in Fig. 5 are with large fluctuations. Similar reasoning can be applied to the case of the denoising performance of our model for homophilic graphs, we omit it here.

5.5 Regarding to Ablation Study

It is worth noting that one of the advantages of assigning an adjustable $p$ -Laplacian-based regularizer is on the convenience of conducting the ablation study. As the key principle of ablation study is to test whether a new component (in our case the regularizer) in a proposed method can always add advantages over baseline model counterparts regardless of the newly involved parameters. This suggests that the ablation study of our method are naturally conducted to compare our models with the regularizers with the underlying networks via both the node classification tasks and the model denoising power test. It is not hard to observed that in most of cases, regardless of the change of $p$ , our proposed model kept outperforming baseline models. However, as our proposed model was driven from an regularization framework, another potential parameter that we note could affect the model performance is the quantity of $\mu$ . Based on Eq. (22), an increase of the quantity of $\mu$ leads to a increase of model’s convexity, as a consequence, could guide model more closed to the global optima when $p\neq 2$ . Another observation to $\mu$ is we found that a smaller value of $\mu$ together with a relatively bigger value of $p$ are more suitable for homophily/heterophilic graphs, this implies that $\mu$ seems to have an opposite effect on model’s adaption power compared to $p$ . However, to quantify the effect of $\mu$ via a suitable measure (i.e., potentially Dirichlet energy (Han et al, 2022)) is out of the scope of this paper, we leave it to future discussions.

6 Conclusion and Further work

This paper showcases the application of $p$ -Laplacian Graph Neural Networks (GNN) in conjunction with framelet graphs. The incorporation of $p$ -Laplacian regularization brings remarkable flexibility, enabling effective adaptation to both homophilic undirected and heterophilic directed graphs, thereby significantly enhancing the predictive capabilities of the model. To validate the efficacy of our proposed model, we conducted extensive numerical experiments on diverse graph datasets, demonstrating its superiority over baseline methods. Notably, our model exhibits robustness against noise perturbation, even under high noise levels. These promising findings highlight the tremendous potential of our approach and warrant further investigations in several directions. For instance, delving into the intriguing mathematical properties of our model, including weak and strong convergence, analyzing the behavior of (Dirichlet) energy, and establishing connections with non-linear diffusion equations, opens up fascinating avenues for future research.

References

Ahmedt-Aristizabal et al (2021) Ahmedt-Aristizabal D, Armin MA, Denman S, Fookes C, Petersson L (2021) Graph-based deep learning for medical diagnosis and analysis: past, present and future. Sensors 21(14):4758
Assis et al (2021) Assis AD, Torres LC, Araújo LR, Hanriot VM, Braga AP (2021) Neural networks regularization with graph-based local resampling. IEEE Access 9:50727–50737
Belkin et al (2004) Belkin M, Matveeva I, Niyogi P (2004) Tikhonov regularization and semi-supervised learning on large graphs. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, vol 3, pp iii–1000
Bozorgnia et al (2019) Bozorgnia F, Mohammadi SA, Vejchodskỳ T (2019) The first eigenvalue and eigenfunction of a nonlinear elliptic system. Applied Numerical Mathematics 145:159–174
Bronstein et al (2017) Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P (2017) Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34(4):18–42
Bruna et al (2014) Bruna J, Zaremba W, Szlam A, LeCun Y (2014) Spectral networks and locally connected networks on graphs. In: Proceedings of International Conference on Learning Representations
Burda et al (2001) Burda Z, Correia J, Krzywicki A (2001) Statistical ensemble of scale-free random graphs. Physical Review E 64(4):046118
Candès et al (2011) Candès EJ, Li X, Ma Y, Wright J (2011) Robust principal component analysis? Journal of the ACM (JACM) 58(3):1–37
Chen et al (2022a) Chen J, Wang Y, Bodnar C, Liò P, Wang YG (2022a) Dirichlet energy enhancement of graph neural networks by framelet augmentation. https://yuguangwanggithubio/papers/EEConvpdf
Chen et al (2022b) Chen Q, Wang Y, Wang Y, Yang J, Lin Z (2022b) Optimization-induced graph implicit nonlinear diffusion. In: Proceedings of the 39th International Conference on Machine Learning
Cheng and Wang (2020) Cheng T, Wang B (2020) Graph and total variation regularized low-rank representation for hyperspectral anomaly detection. IEEE Transactions on Geoscience and Remote Sensing 58(1):391–406, DOI 10.1109/TGRS.2019.2936609
Chien et al (2021) Chien E, Peng J, Li P, Milenkovic O (2021) Adaptive universal generalized pagerank graph neural network. In: Proceedings of International Conference on Learning Representations
Ciotti et al (2016) Ciotti V, Bonaventura M, Nicosia V, Panzarasa P, Latora V (2016) Homophily and missing links in citation networks. EPJ Data Science 5:1–14
Defferrard et al (2016) Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems 29
Dong (2017) Dong B (2017) Sparse representation on graphs by tight wavelet frames and applications. Applied and Computational Harmonic Analysis 42(3):452–479, DOI 10.1016/j.acha.2015.09.005
Fan et al (2019) Fan W, Ma Y, Li Q, He Y, Zhao E, Tang J, Yin D (2019) Graph neural networks for social recommendation. In: Proceedings of WWW, pp 417–426
Frecon et al (2022) Frecon J, Gasso G, Pontil M, Salzo S (2022) Bregman neural networks. In: International Conference on Machine Learning, PMLR, pp 6779–6792
Fu et al (2022) Fu G, Zhao P, Bian Y (2022) $p$ -Laplacian based graph neural networks. In: Proceedings of the 39th International Conference on Machine Learning, PMLR, vol 162, pp 6878–6917
Gasteiger et al (2019) Gasteiger J, Bojchevski A, Günnemann S (2019) Predict then propagate: Graph neural networks meet personalized pagerank. In: Proceedings of International Conference on Learning Representations
Gilmer et al (2017) Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: International Conference on Machine Learning, PMLR, pp 1263–1272
Gu et al (2020) Gu F, Chang H, Zhu W, Sojoudi S, El Ghaoui L (2020) Implicit graph neural networks. In: Advances in Neural Information Processing Systems
Hammond et al (2011) Hammond DK, Vandergheynst P, Gribonval R (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis (2):129–150
Han et al (2022) Han A, Shi D, Shao Z, Gao J (2022) Generalized energy and gradient flow via graph framelets. arXiv:221004124
He et al (2021) He M, Wei Z, Xu H, et al (2021) Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems 34:14239–14251
He and Kempe (2015) He X, Kempe D (2015) Stability of influence maximization. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, p 1256–1265
Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33:6840–6851
Hu et al (2018) Hu X, Sun Y, Gao J, Hu Y, Yin B (2018) Locality preserving projection based on F-norm. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp 1330–1337
Kipf and Welling (2017) Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conference on Learning Representations
Leskovec and Faloutsos (2006) Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining, p 631–636, DOI 10.1145/1150402.1150479
Li et al (2022) Li J, Lin S, Blanchet J, Nguyen VA (2022) Tikhonov regularization is optimal transport robust under martingale constraints. 2210.01413
Lin and Gao (2023) Lin L, Gao J (2023) A magnetic framelet-based convolutional neural network for directed graphs. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5, DOI 10.1109/ICASSP49357.2023.10097148
Liu et al (2022) Liu A, Li B, Li T, Zhou P, Wang R (2022) An-gcn: An anonymous graph convolutional network against edge-perturbing attacks. IEEE Transactions on Neural Networks and Learning Systems pp 1–15, DOI 10.1109/TNNLS.2022.3172296
Liu et al (2010) Liu G, Lin Z, Yu Y (2010) Robust subspace segmentation by low-rank representation. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 663–670
Liu et al (2021a) Liu J, Kawaguchi K, Hooi B, Wang Y, Xiao X (2021a) Eignn: Efficient infinite-depth graph neural networks. In: Advances in Neural Information Processing Systems
Liu et al (2021b) Liu X, Jin W, Ma Y, Li Y, Liu H, Wang Y, Yan M, Tang J (2021b) Elastic graph neural networks. In: International Conference on Machine Learning, PMLR, pp 6837–6849
Luo et al (2010) Luo D, Huang H, Ding CHQ, Nie F (2010) On the eigenvectors of $p$ -Laplacian. Machine Learning 81(1):37–51
Ly (2005) Ly I (2005) The first eigenvalue for the p-laplacian operator. JIPAM J Inequal Pure Appl Math 6:91
Ma et al (2021) Ma Y, Liu X, Zhao T, Liu Y, Tang J, Shah N (2021) A unified view on graph neural networks as graph signal denoising. URL https://openreview.net/forum?id=MD3D5UbTcb1
Manessi et al (2020) Manessi F, Rozza A, Manzo M (2020) Dynamic graph convolutional networks. Pattern Recognition 97:107000
Oka and Yamada (2023) Oka T, Yamada T (2023) Topology optimization method with nonlinear diffusion. Computer Methods in Applied Mechanics and Engineering 408:115940, DOI https://doi.org/10.1016/j.cma.2023.115940, URL https://www.sciencedirect.com/science/article/pii/S0045782523000634
Ortega et al (2018) Ortega A, Frossard P, Kovacević J, Moura JMF, Vandergheynst P (2018) Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE 106(5):808–828
Pandit et al (2007) Pandit S, Chau DH, Wang S, Faloutsos C (2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. In: Proceedings of the 16th international conference on World Wide Web, pp 201–210
Park et al (2021) Park J, Choo J, Park J (2021) Convergent graph solvers. In: Proceedings of International Conference on Learning Representations
Scarselli et al (2009) Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80
Sweldens (1998) Sweldens W (1998) The lifting scheme: A construction of second generation wavelets. SIAM Journal on Mathematical Analysis 29(2):511–546, DOI 10.1137/S0036141095289051, https://doi.org/10.1137/S0036141095289051
Tenenbaum et al (2000) Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. science 290(5500):2319–2323
Veličković et al (2018) Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2018) Graph attention networks. In: Proceedings of International Conference on Learning Representations
Wen et al (2023) Wen H, Lin Y, Xia Y, Wan H, Zimmermann R, Liang Y (2023) Diffstg: Probabilistic spatio-temporal graph forecasting with denoising diffusion models. arXiv preprint arXiv:230113629
Wu et al (2019) Wu F, Zhang T, Souza AHd, Fifty C, Yu T, Weinberger KQ (2019) Simplifying graph convolutional networks. In: Proceedings of International Conference on Machine Learning
Wu et al (2021) Wu J, Sun J, Sun H, Sun G (2021) Performance analysis of graph neural network frameworks. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pp 118–127, DOI 10.1109/ISPASS51385.2021.00029
Wu et al (2022) Wu S, Sun F, Zhang W, Xie X, Cui B (2022) Graph neural networks in recommender systems: a survey. ACM Computing Surveys to appear
Wu et al (2020) Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32(1):4–24
Xu et al (2018a) Xu K, Hu W, Leskovec J, Jegelka S (2018a) How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representations
Xu et al (2018b) Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi Ki, Jegelka S (2018b) Representation learning on graphs with jumping knowledge networks. In: Proceedings of International Conference on Machine Learning
Yang et al (2022) Yang M, Zheng X, Yin J, Gao J (2022) Quasi-Framelets: Another improvement to graph neural networks. arXiv:220104728 2201.04728
Zhang et al (2003) Zhang Q, Gu Y, Mateusz M, Baktashmotlagh M, Eriksson A (2003) Implicitly defined layers in neural networks. arxiv200301822
Zhang et al (2022) Zhang Z, Cui P, Zhu W (2022) Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering 34(1):249–270, DOI 10.1109/TKDE.2020.2981333
Zheng et al (2021a) Zheng X, Liu Y, Pan S, Zhang M, Jin D, Yu PS (2021a) Graph neural networks for graphs with heterophily: A survey. In: Proceedings of the AAAI Conference on Artificial Intelligence
Zheng et al (2021b) Zheng X, Zhou B, Gao J, Wang YG, Lio P, Li M, Montufar G (2021b) How framelets enhance graph neural networks. In: Proceedings of International Conference on Machine Learning
Zheng et al (2022) Zheng X, Zhou B, Wang YG, Zhuang X (2022) Decimated framelet system on graphs and fast g-framelet transforms. Journal of Machine Learning Research 23:18–1
Zhou et al (2021a) Zhou B, Li R, Zheng X, Wang YG, Gao J (2021a) Graph denoising with framelet regularizer. arXiv:211103264
Zhou et al (2021b) Zhou B, Liu X, Liu Y, Huang Y, Lio P, Wang Y (2021b) Spectral transform forms scalable transformer. arXiv:211107602
Zhou and Schölkopf (2005) Zhou D, Schölkopf B (2005) Regularization on discrete spaces. In: The 27th DAGM Symposium,, August 31 - September 2, 2005, Vienna, Austria, vol 3663, pp 361–368
Zhou et al (2020) Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M (2020) Graph neural networks: A review of methods and applications. AI Open 1:57–81
Zhu et al (2020a) Zhu J, Yan Y, Zhao L, Heimann M, Akoglu L, Koutra D (2020a) Beyond homophily in graph neural networks: Current limitations and effective designs. In: Advances in Neural Information Processing Systems, vol 33, pp 7793–7804
Zhu et al (2021) Zhu M, Wang X, Shi C, Ji H, Cui P (2021) Interpreting and unifying graph neural networks with an optimization framework. In: Proceedings of WWW
Zhu et al (2020b) Zhu S, Pan S, Zhou C, Wu J, Cao Y, Wang B (2020b) Graph geometry interaction learning. Advances in Neural Information Processing Systems 33:7548–7558
Zou et al (2022) Zou C, Han A, Lin L, Gao J (2022) A simple yet effective SVD-GCN for directed graphs. arxiv220509335

		$\displaystyle\frac{\partial\mathcal{L}^{\phi}_{p}(\mathbf{F})}{\partial\mathbf{f}_{i}}=2\mu(\mathbf{f}_{i}\!-\!\mathbf{y}_{i})\!+\!\frac{1}{2}\sum_{v_{k}\in\mathcal{V}}\frac{\partial}{\partial\mathbf{f}_{i}}\phi(\\|\nabla_{W}\mathbf{F}(v_{k})\\|_{p})$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}-\mathbf{y}_{i})+\frac{1}{2}\frac{\partial}{\partial\mathbf{f}_{i}}\phi(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})+\frac{1}{2}\sum_{v_{j}\sim v_{i}}\frac{\partial}{\partial\mathbf{f}_{i}}\phi(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}-\mathbf{y}_{i})+\frac{1}{2}\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})\frac{1}{p}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})^{1-p}\frac{\partial}{\partial\mathbf{f}_{i}}\!\left(\!\sum_{v_{j}\sim v_{i}}\\|\nabla_{i,j}(w,\mathbf{f})\\|^{p}\!\right)\!$
		$\displaystyle+\!\frac{1}{2p}\sum_{v_{j}\sim v_{i}}\frac{\partial}{\partial\mathbf{f}_{i}}\\|\nabla_{j,i}(w,\mathbf{f})\\|^{p}\cdot\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})^{1-p}$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}-\mathbf{y}_{i})+\frac{1}{2p}\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})^{1-p}\cdot$
		$\displaystyle\quad\quad\left(\!\sum_{v_{j}\sim v_{i}}p\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\sqrt{\frac{w_{i,j}}{d_{i,i}}}(-\nabla_{W}\mathbf{F}([i,j]))\!\right)$
		$\displaystyle+\frac{1}{2}\sum_{v_{j}\sim v_{i}}\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})^{1-p}\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\sqrt{\frac{w_{i,j}}{d_{i,i}}}\nabla_{W}\mathbf{F}([j,i])$
	$\displaystyle=$	$\displaystyle 2\mu(\mathbf{f}_{i}\!-\!\mathbf{y}_{i})\!+\!\sum_{v_{j}\sim v_{i}}\!\frac{1}{2}\\|\nabla_{W}\mathbf{F}([i,j])\\|^{p-2}\!\!\sqrt{\frac{w_{i,j}}{d_{i,i}}}\nabla_{W}\mathbf{F}([j,i])\cdot$
		$\displaystyle\quad\quad\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{i})\\|^{p-1}_{p}}+\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{j})\\|^{p-1}_{p}}\right]$