The Devil is in the Conflict: Disentangled Information Graph Neural Networks For Fraud Detection

Zhixun Li^{1, ${\dagger}$}, Dingshuo Chen^{2,3, ${\dagger}$}, Qiang Liu^2,3, Shu Wu^2,3,* ^${\dagger}$The first two authors contributed equally to this work.^*Corresponding author. ¹School of Computer Science and Technology, Beijing Institute of Technology
²Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences
³School of Artificial Intelligence, University of Chinese Academy of Sciences
[email protected], [email protected], {qiang.liu, shu.wu}@nlpr.ia.ac.cn

Abstract

Graph-based fraud detection has heretofore received considerable attention. Owning to the great success of Graph Neural Networks (GNNs), many approaches adopting GNNs for fraud detection has been gaining momentum. However, most existing methods are based on the strong inductive bias of homophily, which indicates that the context neighbors tend to have same labels or similar features. In real scenarios, fraudsters often engage in camouflage behaviors in order to avoid detection system. Therefore, the homophilic assumption no longer holds, which is known as the inconsistency problem. In this paper, we argue that the performance degradation is mainly attributed to the inconsistency between topology and attribute. To address this problem, we propose to disentangle the fraud network into two views, each corresponding to topology and attribute respectively. Then we propose a simple and effective method that uses the attention mechanism to adaptively fuse two views which captures data-specific preference. In addition, we further improve it by introducing mutual information constraints for topology and attribute. To this end, we propose a Disentangled Information Graph Neural Network (DIGNN) model, which utilizes variational bounds to find an approximate solution to our proposed optimization objective function. Extensive experiments demonstrate that our model can significantly outperform state-of-the-art baselines on real-world fraud detection datasets.

Index Terms:

Graph Neural Networks, Fraud Detection, Information Theory

I Introduction

Graph-based fraud detection is a crucial task and has tremendous impact in various applications, such as opinion fraud detection [1], fake news detection [2, 3], review spams[4] and financial fraud detection [5, 6]. In these scenarios, as graph can effectively model the correlations among entities, interactive activities on platform can be characterized as a graph, where users or objects are often treated as nodes, and transactions or relations between them are treated as edges.

Numerous techniques have been proposed to detect the fraudsters. Recently, driven by the powerful representation capability of graph structure and advances of Graph Neural Networks (GNNs) [7, 8, 9], many approaches try to harness GNNs for fraud detection on either homogeneous or heterogeneous graphs. The main idea is to leverage GNNs to learn expressive node representations with the goal of distinguishing abnormal nodes from the normal ones in the latent embedding space. Message-Passing GNNs (MP-GNNs) are mainstreaming in recent years, which aggregate neighbor node features and achieve local smoothing by stacking layers. Although MP-GNNs can obtain satisfactory performance on most of cases, the strong inductive bias of homophily limits their representative ability on heterophilic graphs. Some works [10] point out that plentiful GNNs can be seen as low-pass filters, so their generalization ability on high frequency graph signals are poor. In fraud detection task, fraudsters often imitate normal users in order to camouflage themselves, hence they will interact with normal users more frequently. For instance, normal users account for 81% of the fraudsters’ neighbor nodes in YelpChi dataset (Figure 1). In other words, fraudsters’ features are inconsistent with their behaviors (interactions, e.g., topological structure). Thus, recall that MP-GNNs do not work well on heterophilic graphs, they fail to tackle the inconsistency phenomenon in graph-based fraud detection and fraudsters could fool the detection system.

Recently, a few works have noticed this problem, and they employ aggregating weights to reduce the adverse impact of dissimilar neighbors, or set similarity-aware thresholds to select and re-link similar nodes. For instance, GraphConsis [11] computes consistent score between connected node pairs as the sampling probability. PC-GNN [6] combines label information and latent embeddings as distance function to measure similarity. Although such methods can alleviate the inconsistency problem in some extent, they discard a lot of information during filting dissimilar neighbors out, thus they may lead to sub-optimal performance.

In this paper, we analyze the inconsistency problem in graph-based fraud detection task, which has been obstructing a full understanding of this field. First, we clarify that the inconsistency problem is the bottleneck of graph fraud detection. According to [12], the underlying optimization process of GNNs is equivalent with minimizing the topology and attribute constraints, and Yang et al.[13] indicates that the degradation of performance is imputed to the compromise between topology and attribute. Due to the camouflage behaviors (topology) of fraudsters, which are inconsistent with their essence (attribute), this conflict in fraud networks may injure the discriminative ability of GNNs. Second, the forefronts of different datasets are diverse, and most existing methods are not satisfactory in fusing topological structures and node attributes [14]. For example, fraudsters may possess distinguishable attribute on some platforms, but their deceptive behaviors can confuse the detection model. Therefore, we are motivated to explore a novel method that is able to minimize the conflict between topology and attribute and meanwhile effectively extract most task-relevant information from datasets.

We borrow the concept of multi-view learning problems to graph-based fraud detection task and propose a simple and effective model, Disentangled Information Graph Neural Networks (DIGNN). Technically, we first disentangle fraud networks into topology and attribute views. Next, we employ attention mechanism to fuse two view embeddings adaptively for extracting task-relevant information. Surprisingly, we observe that this simple method surpasses all state-of-the-art baselines. This empirically proves that the conflict between topology and attribute causes the inconsistency problem. Besides, to further decrease the entanglement between topology and attribute and improve the performance, we design a new optimization objective based on information theory, which resorts to variational bounds to minimize mutual information between two views and maximize the mutual information between view embeddings and original inputs.

We conduct extensive experiments to compare our proposed model with existing graph-based fraud detection models, the results demonstrate the effectiveness of our model. In summary, the contributions of this paper can be summarized as follows:

•

We analyze the cause of the inconsistency problem, and point out that it is mainly attributed to the conflict between topology and attribute. In light of this, we propose a simple yet effective model, DIGNN, which firstly disentangles fraud network into two views and fuses them by attention mechanism.
•

We propose a novel optimization objective based on mutual information theory and theoretically derive its upper bound for tractable calculation.
•

We verify the effectiveness of our model on real-world fraud detection datasets. It is shown that our model is able to significantly improve the performance in terms of all commonly adopted metrics.

Refer to caption — Figure 1: (a) Illustration of graph-based fraud detection. (b) Neighbor distribution of fraudsters and benign users in YelpChi dataset.

II Related Work

II-A Graph-based Fraud Detection

The core idea of graph-based fraud detection task is taking the advantages of GNNs to get the discriminative node embeddings, and find out the malicious ones in the latent space. Examples include [11, 15, 16] for review fraud detection, [2, 3] for fake news detection and [5, 6, 17, 18, 19] for financial fraud detection. Ma et al. [20] provides a comprehensive investigation on graph-based fraud detection.

Most of existing GNNs methods holds homophilic assumption that neighbor nodes share same labels or similar features. However, fraudsters will try to conceal themselves, so that their features are inconsistent with their camouflage behaviors. Some graph-based fraud detection works have noticed this problem. GraphConsis [11] pioneers to formulate and tackle the inconsistency problem. They introduce three kinds of inconsistency phenomenon existing in fraud networks. CARE-GNN [15] devises a label-aware similarity measure to find informative neighboring nodes and utilizes reinforcement learning to select similar neighbors. FRAUDRE [21] aggregates difference between adjacent node pairs. PC-GNN [6] devises a choose operation to select beneficial neighbors based on feature similarity. IHGAT [22] is devised to encode both sequence-like intentions and relationship among transactions for leveraging the cross-interaction information.

Our model is different from all above works. We innovatively disentangle topology and attribute and consider graph learning as a multi-view learning problem, instead of measuring similarity between adjacent node pairs.

II-B Multi-view on GNNs

Topology and attribute are two essential compositions of graphs. However existing state-of-the-art GNN models are disable to effectively fuse topological structure and node attributes. AM-GCN [14] uses $k$ -nearest neighbor to construct feature graph and combine it with topological structure view and common embeddings. SCRL [23] designs a self-supervised approach to maximize the agreement of the embeddings in the topology graph and the feature graph. A recent work [13] claims that the interference between topology and attribute is mainly ascribed to compromises between them. LINKX [24] processes node attributes and topological structure in an orthogonal manner. In this paper, we also follow this idea and extend it by proposing a novel architecture and optimization objective.

Information-theoretic methods have been gaining momentum in recent years, which take into consideration the mutual dependency of different views. MIB[25] extends the information bottleneck principle to unsupervised multi-view setting to discard superfluous information. DVIB[26] and CMIB[27] leverage mutual information constrains to better preserve shared and private information of multi-view learning. To cope with intractable computation of mutual information, these methods adopt variational inference to optimize objective lower and upper bounds. In comparison, our model propose a novel optimization objective to reconcile consistency and complementarity between topology and attribute views. Equipped with variational inference, we also approximate the mutual information with derived bounds.

III PRELIMINARIES

III-A Problem Statement

Definition 1. Graph-based Fraud Detection. Given a fraud network $\mathcal{G}=(\mathcal{V},\mathbf{A},\mathbf{X})$ , where $\mathcal{V}=\{v_{1},v_{2},\ldots,v_{N}\}$ is the set of nodes; $\mathbf{A}\in\mathbb{R}^{N\times N}$ is the adjacency matrix, if $v_{i}$ and $v_{j}$ are connected, $\mathbf{A}_{ij}=1$ , otherwise, $\mathbf{A}_{ij}=0$ ; $\mathbf{X}\in\mathbb{R}^{N\times D}$ denotes node feature matrix, each node $v_{i}$ is associated with a $D$ -dimensional feature vector $\mathbf{x}_{i}$ and a label $y_{i}\in\{0,1\}$ , where 0 denotes the node is a normal user (negative) and 1 indicates it is a fraudster (positive). The core idea of graph-based fraud detection is to learn discriminative node embeddings to detect the anomaly samples in latent space.

Definition 2. Graph Neural Networks. Most of GNNs follow message passing mechanism which uses feature transformation and aggregation operations to capture the structural and attribute information. One of the most popular and representative GNNs model is graph convolutional networks (GCNs). The forward inference at the $l$ -th layer of GCNs is formally defined as:

\mathbf{H}^{(l)}=\sigma(\hat{\mathbf{A}}\mathbf{H}^{(l-1)}\mathbf{W}^{(l)}),

(1)

where $\sigma(\cdot)$ is nonlinear activation function, $\mathbf{W}^{(l)}\in\mathbb{R}^{d\times d}$ is the linear transformation matrix, $\mathbf{H}^{(l)}$ denotes the node embedding matrix at the $l$ -th layer, and $\mathbf{H}^{(0)}=\mathbf{X}$ , $\hat{\mathbf{A}}$ is the normalized adjacency matrix, which can be implemented by row-normalization, $\hat{\mathbf{A}}=\mathbf{D}^{-1}(\mathbf{A}+\mathbf{I})$ or symmetric normalization, $\hat{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}$ , and $\mathbf{D}$ is a diagonal matrix, $\mathbf{I}$ is an identity matrix.

Interestingly, some works [12] have declared that representative GNN models can be regarded as solving a Graph Signal Denoising problem, which given a noisy signal $\mathbf{S}\in\mathbb{R}^{N\times d_{in}}$ on graph $\mathcal{G}$ , the goal is to recover a clean signal $\mathbf{F}\in\mathbb{R}^{N\times d_{out}}$ :

\underset{\mathbf{F}}{\arg\min}\ \mathcal{L}=\lVert\mathbf{F}-\mathbf{S}\rVert_{F}^{2}+c\cdot tr(\mathbf{F}^{\top}\mathbf{L}\mathbf{F}),

(2)

where the first term guides output signal $\mathbf{F}$ similar to original signal $\mathbf{S}$ and the second term encourages signal smoothness on graph.

IV METHOD

In this section, we will present our model, Disentangled Information Graph Neural Network (DIGNN). Figure 2 gives an overview of our model. It consists of three main objectives: 1. Disentangle attribute fraud network into topology and attribute views and fuse them by attention mechanism, the final embeddings are trained with Cross-Entropy loss; 2. To further reduce the conflict between two views, we minimize the mutual information between them; 3. In order to maintain the semantic information from input space, we maximize the mutual information between view-specific embeddings and their original inputs.

IV-A View-specific Embedding

It is universally acknowledged that topology and attribute are of vital importance for graph learning. However, in graph fraud detection scenario, traditional message passing along neighboring nodes is inappropriate as graph signal smoothing makes fraudsters more indistinguishable. To alleviate the inconsistency problem, we disentangle the topology and attribute information and encode them in parallel.

Given an attributed fraud network $\mathcal{G}$ , it can be disentangled into topology view $\mathbf{A}$ and attribute view $\mathbf{X}$ . Here we provide two view encoders $f_{A},f_{X}$ for each input view, as shown in Figure 2. Specifically, we employ Multi-Layer Perceptron (MLP) as encoders to obtain view-specific embeddings $\mathbf{Z}^{A},\mathbf{Z}^{X}\in\mathbb{R}^{N\times d}$ :

\mathbf{Z}^{A}=f_{A}(\mathbf{A}),\ \mathbf{Z}^{X}=f_{X}(\mathbf{X}),

(3)

in which $d$ is the embedding dimension. With these two embeddings, we need to fuse them to obtain final representation and extract task-relevant information.

IV-B Cross-view Fusion

Now we have two view-specific embeddings $\mathbf{Z}^{A}$ and $\mathbf{Z}^{X}$ , we then perform cross-view fusion by utilizing attention mechanism. The attention value $\omega_{i}$ can be represented as:

\omega_{i}^{j}=q\cdot\tanh(\mathbf{W}\cdot(\mathbf{z}_{i}^{j})^{\top}+b),\ j\in\{A,X\}

(4)

where $q$ denotes the learnable attention vector, $\mathbf{W}$ is the weight matrix and $b$ is bias vector. Thus, we can get the attention values $\omega^{A}_{i}$ and $\omega^{X}_{i}$ for view-specific embeddings $\mathbf{z}^{A}_{i}$ and $\mathbf{z}^{X}_{i}$ , respectively. Then we normalize them via softmax function to get the final weight:

\alpha_{i}^{j}=\text{softmax}(\omega_{i}^{j})=\frac{\exp(\omega_{i}^{j})}{\exp(\omega_{i}^{A})+\exp(\omega_{i}^{X})},\ j\in\{A,X\}

(5)

Larger attention weight $\alpha_{i}$ implies that the corresponding embeddings is more important, and it is determined by specific dataset. Then the final output embedding $\mathbf{z}_{i}$ can be combined by two view-specific embeddings with its corresponding attention weight as:

\mathbf{z}_{i}=\alpha_{i}^{A}\cdot\mathbf{z}_{i}^{A}+\alpha_{i}^{X}\cdot\mathbf{z}_{i}^{X}.

(6)

And we put it into a linear classifier, while training by a cross-entropy loss function:

\mathcal{L}_{ce}=\sum_{v_{i}\in\mathcal{V}_{\text{train}}}-\log(y_{i}\cdot\sigma(\mathbf{W}^{\prime}\cdot\mathbf{z}_{i}+b^{\prime}))

(7)

in which $\mathbf{W}^{\prime}$ and $b^{\prime}$ is the weight matrix and bias vector of linear classifier, $\sigma$ is a softmax function, and $\mathcal{V}_{\text{train}}$ is the training node set.

IV-C Mutual Information Optimization

Up to now, we have discussed how to get view-specific embeddings and fuse them with attention mechanism. However, as mentioned in [13], the representative GNN models tend to deteriorate their expressive power due to interference between attribute and topology. In spite of decoupling operation, it is still impractical to look forward to injecting mutual-exclusive learning ability to our model simply equipped with attention mechanism. In other words, we need to seek some principles to guide the training procedure. By leveraging the information theory, we propose a novel optimization objective to alleviate the aforementioned problem. Furthermore, we derive the variational bound of our optimization objective and discuss the intrinsic effect and intuitive insight. Without loss of generality, we let $\mathbf{X}_{1},\mathbf{X}_{2}$ to represent original views and $\mathbf{Z}_{1},\mathbf{Z}_{2}$ to represent view-specific embeddings for ease of reading.

IV-C1 Optimization principles

The first principle aims to induce model to learn mutual-exclusive embeddings, which ameliorates the compromise problem between attribute and topology. Considering that mutual information measures the mutual dependence of variables, we introduce the constraint term $\textbf{min }I(\mathbf{Z}_{1},\mathbf{Z}_{2})$ to our optimization objective. In this way, model is able to reduce the redundancy and enhance the ability on exploiting sufficient semantic information in embedding space with limited dimensionality.

Nevertheless, mutual-exclusive constraint is prone to impair the helpful shared information. For instance, in Amazon dataset, handcrafted features are highly correlated to social networks (topology), thus mutual-exclusive constraint will injure attribute semantics during training. The second principle builds the relationship between view-specific embeddings and their original inputs. In virtue of rich but distinct semantics inherent in the attribute and topology, it is necessary to extract useful features and meanwhile maintain respective information from input data space. We further introduce the constraint term $\textbf{max }I(\mathbf{Z}_{i},\mathbf{X}_{i})$ to our optimization objective to encode inputs with more view-specific information available. To sum up, our mutual information optimization objective can be summarized as follow:

\displaystyle\text{min }I(\mathbf{Z}_{1},\mathbf{Z}_{2})-\sum_{i=1}^{2}I(\mathbf{Z}_{i},\mathbf{X}_{i})

(8)

IV-C2 Theoretical Analysis

Recall that the optimization objective has the form $\text{min }I(\mathbf{Z}_{1},\mathbf{Z}_{2})-\sum_{i=1}^{2}I(\mathbf{Z}_{i},\mathbf{X}_{i})$ . However, it is intractable to directly calculate the mutual information for high dimensional variables [28]. We alternatively sort to derive variational bounds of the mutual information to find an approximate solution to original objective function. In this work, $\mathbf{X},\mathbf{Y},\mathbf{Z}$ denote random variables, and $x,y,z$ denote corresponding instances.

Lower bound of $I(\mathbf{Z},\mathbf{X})$ . For conciseness, we use term $I(\mathbf{Z},\mathbf{X})$ to represent $I(Z_{i},X_{i})(i\in\{1,2\})$ . According to the definition of mutual information, we have

	$\displaystyle I(Z,X)$	$\displaystyle=\mathbb{E}_{p(x,z)}\log\frac{p(x\|z)}{p(x)}$		(9)
		$\displaystyle\geq\mathbb{E}_{p(x,z)}\log\ q(x\|z)+H(\mathbf{X})$		(10)

where $H(\mathbf{X})$ is the entropy of $X$ ; $q(x|z)$ is the variational approximation of conditional distribution $p(x|z)$ . Notice that the entropy of random variable $X$ is independent of our optimization procedure. Therefore, maximization of $I(\mathbf{Z},\mathbf{X})$ is equivalent to maximize $\mathbb{E}_{p(x,z)}log\ q(x|z)$ .

Upper bound of $I(\mathbf{Z}_{1},\mathbf{Z}_{2})$ . The overall optimization transformation is demonstrated as Figure 3. At the beginning, we introduce some basic properties in information theory [29]:

	$\displaystyle I(\cdot;\cdot)\geq 0$		(11)
	$\displaystyle I(\mathbf{X}\mathbf{Y};\mathbf{Z})=I(\mathbf{Y};\mathbf{Z})+I(\mathbf{X};\mathbf{Z}\|\mathbf{Y})$		(12)

Following [25], we have

	$\displaystyle I(\mathbf{Z}_{1},\mathbf{Z}_{2})$	$\displaystyle\leq I(\mathbf{Z}_{1},\mathbf{Z}_{2})+I(\mathbf{Z}_{1};\mathbf{X}_{2}\|\mathbf{Z}_{2})$
		$\displaystyle=I(\mathbf{Z}_{1};\mathbf{Z}_{2}\mathbf{X}_{2})$
		$\displaystyle=I(\mathbf{Z}_{1};\mathbf{Z}_{2}\mathbf{X}_{2})-I(\mathbf{Z}_{1};\mathbf{Z}_{2}\|\mathbf{X}_{2})$
		$\displaystyle=I(\mathbf{Z}_{1},\mathbf{X}_{2})$

Here we assume that $\mathbf{Z}_{2}$ is a sufficient representation of $\mathbf{X}_{2}$ , which means $I(\mathbf{Z}_{1};\mathbf{Z}_{2}|\mathbf{X}_{2})=0$ holds. According to symmetry of mutual information, we can follow the same formulation and derive an equivalent form of $I(\mathbf{Z}_{1},\mathbf{Z}_{2})$ .

\displaystyle I(\mathbf{Z}_{2},\mathbf{Z}_{1})\leq I(\mathbf{Z}_{2},\mathbf{X}_{1})

(13)

Then, we take the mutual information $I(\mathbf{Z}_{1},\mathbf{X}_{2})$ as an example and derive the upper bound. We have:

$\displaystyle I(\mathbf{Z}_{1},\mathbf{X}_{2})$	$\displaystyle=\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{1}}(z_{1}\|x_{2})}{p(z_{1})}$	(14)
	$\displaystyle=\mathbb{E}_{p(z_{1},x_{2})}log\frac{p_{x_{1}}(z_{1}\|x_{2})}{r(z_{1})}\frac{r(z_{1})}{p(z_{1})}$	(15)
	$\displaystyle=\mathbb{E}_{p(z_{1},x_{2})}log\frac{p_{x_{1}}(z_{1}\|x_{2})}{r(z_{1})}-KL\left[r(z_{1})\\|p(z_{1})\right]$	(16)
	$\displaystyle\leq\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{1}}(z_{1}\|x_{2})}{r(z_{1})}$	(17)
	$\displaystyle=\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{1}}(z_{1}\|x_{2})}{p_{x_{2}}(z_{2}\|x_{2})}$	(18)
	$\displaystyle\qquad\qquad\qquad+\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{2}}(z_{2}\|x_{2})}{r(z_{1})}$	(19)

where $p_{x_{1}}$ and $p_{x_{2}}$ represent encoders that encode information from original feature space. The upper bound will become tighter as the marginal distribution $r(z_{1})$ approaches the priors $p(z_{1})$ . By observing the two terms in the upper bound. The first term measures the difference of two latent representations from $p_{x_{1}}$ and $p_{x_{2}}$ but with the same input $x_{2}$ , while the second term measures the difference between encoder $p_{x_{2}}$ with the approximated margin $r(z_{1})$ . According to [26], the two terms has the same optimization directions, thus we simplify the upper bound to

\displaystyle I(\mathbf{Z}_{1},\mathbf{X}_{2})\leq\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{2}}(z_{2}|x_{2})}{r(z_{1})}

(20)

Again, we take into consideration the symmetry of the mutual information and take average of the two form to formulate the final upper bound.

\begin{split}I(\mathbf{Z}_{1},\mathbf{Z}_{2})\leq\frac{1}{2}&\left[\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{2}}(z_{2}|x_{2})}{r(z_{1})}\right.\\ &+\left.\mathbb{E}_{p(z_{2},x_{1})}\log{\frac{p_{x_{1}}(z_{1}|x_{1})}{r(z_{2})}}\right]\end{split}

(21)

In practice, we minimize the reconstruction loss to equivalently minimize the lower bound of $I(\mathbf{Z},\mathbf{X})$ , as done in auto-encoder models[30]. According to the type of input $x$ , $q(x|z)$ can be any appropriate distribution with known probability density function. Here we let the $q(x|z)$ be the Gaussian distribution $\mathcal{N}(x;\mu(z),\sigma^{2}I)$ with given variance $\sigma^{2}$ and deterministic mean function $\mu(\cdot)$ which is parameterized by neural networks, we have

	$\displaystyle\mathcal{L}_{rec}$	$\displaystyle=-\sum_{i=1}^{2}\mathbb{E}_{p(x_{i},z_{i})}\log\ q(x_{i}\|z_{i})$		(22)
		$\displaystyle\propto\sum_{i=1}^{2}\mathbb{E}_{p(x_{i},z_{i})}\left[\left\\|x_{i}-\mu_{i}(z_{i})\right\\|_{2}^{2}\right]$		(23)

To maximize the upper bound of $I(\mathbf{Z}_{1},\mathbf{Z}_{2})$ , we define $p_{x_{i}}(z_{i}|x_{i})$ and $r(z_{i})$ as the Gaussian distribution $\mathcal{N}(z_{i};\mu(x_{i}),\sigma_{i}^{2}I)$ and $\mathcal{N}(z_{i};\mu_{i},\sigma_{i}^{2}I)$ respectively, where $\mu_{i}$ and $\sigma_{i}$ are given expectation and variance. Then we employ the reparameterization trick[31] to rewrite $p_{x_{i}}(z_{i}|x_{i})=p(\epsilon_{i})$ , where $z_{i}=\mu(x_{i})+\epsilon_{i}\sigma,\epsilon_{i}\sim\mathcal{N}(0,I)$ and our mutual-exclusive loss $\mathcal{L}_{exc}$ is defined as

\displaystyle\mathcal{L}_{exc}

\displaystyle=\frac{1}{2}\left[\mathbb{E}_{p(z_{1},x_{2})}\log\frac{p_{x_{2}}(z_{2}|x_{2})}{r(z_{1})}+\mathbb{E}_{p(z_{2},x_{1})}\log\frac{p_{x_{1}}(z_{1}|x_{1})}{r(z_{2})}\right]

(24)

Eventually, the overall optimization objective is formulated as follow

\displaystyle\mathcal{L}=\mathcal{L}_{ce}+\alpha\cdot\mathcal{L}_{rec}+\beta\cdot\mathcal{L}_{exc}

(25)

where $\alpha$ and $\beta$ are scalar factors. Moreover, it is worth noting that the second term reconstruction loss is equivalent to graph signal denoising but without signal smoothness, which is reasonable considering the inconsistency problem of graph anomaly detection. Intuitively, our loss function denoises the original graph signal and achieves mutual exclusion between attribute and topology together with supervised information. The training procedure is presented in Algorithm. 1.

Input:

\mathcal{G}=(\mathcal{V},\mathbf{A},\mathbf{X})

: A fraud network,

\mathcal{V}_{\text{train}}

: Set of training nodes,

N_{\text{epoch}}

: Number of total training epochs,

N_{\text{batch}}

: Number of training batch size,

d

: Dimension of hidden embeddings,

\alpha

and

\beta

: hyperparameters of loss balance factors.

Output: The vector representations for each node in

\mathcal{V}

1Initialization

v\in\mathcal{V}_{\text{train}}

;

2 for $e=1,\ldots,N_{\text{epoch}}$ do

3 Calculate the number of training batches

B=\lceil\frac{|\mathcal{V}_{\text{train}}|}{N_{\text{batch}}}\rceil

;

4 Down-sample negative samples according to the number of positive samples;

5 for $b=1,\ldots,B$ do

6 Gather nodes of batch

b

along with edges between them to construct sub-graph

\mathcal{G}_{b}=(\mathcal{V}_{b},\mathbf{A}_{b},\mathbf{X}_{b})

;

\mathbf{Z}^{A},\mathbf{Z}^{X}\leftarrow

Eq. 3;

\mathbf{Z}\leftarrow

fuse two view-specific embeddings

\mathbf{Z}^{A}

and

\mathbf{Z}^{X}

w.r.t. Eq. 6;

\mathcal{L}_{ce}\leftarrow

Eq. 7 // Cross-Entropy Loss

\mathcal{L}_{rec}\leftarrow

Eq. 22; // Reconstruction Loss

\mathcal{L}_{exc}\leftarrow

Eq. 24; // Mutual-exclusive Loss

\mathcal{L}\leftarrow

computer final loss w.r.t. Eq. 25;

10 Back-propagation to update parameters;

11 end for

13 end for

Algorithm 1 DIGNN: Disentangled Information Graph Neural Network

V Experiments

In this section, we investigate the effectiveness of DIGNN, and aim to answer the following research questions:

•

RQ1: Does disentangle operation benefit to inconsistency problem?
•

RQ2: Does DIGNN outperform the state-of-the-art methods for graph-based fraud detection?
•

RQ3: How do different components of DIGNN contributes to performance improvement in graph-based fraud detection task?
•

RQ4: What is the performance of DIGNN with respect to different hyperparameters?

V-A Experiment Setup

V-A1 Datasets.

Our proposed DIGNN model is evaluated on two real-world opinion fraud network datasets: YelpChi and Amazon. The YelpChi dataset [32] collects hotel and restaurant reviews on Yelp.com online platform. The nodes of YelpChi dataset are reviews with 32 handcrafted features and the dataset includes three relations: 1) R-U-R that connects reviews posted by the same user, 2) R-S-R that connects reviews under the same product with the same star rating, 3) R-T-R that connects two reviews under the same product posted in the same month. The Amazon dataset [33] includes product reviews under the Musical Instrument category. The nodes in the graph of Amazon dataset are users with 25 handcrafted features and also contain three relations: 1) U-P-U that connects users reviewing at least one same product, 2) U-S-U that connects users having at least one same star rating within one week, 3) U-V-U that connects users with top 5% mutual review text similarities (measured by TF-IDF) among all users. The statistic of datasets is shown in Table I.

V-A2 Baselines.

We compare with several representative state-of-the-art models to verify the effectiveness of DIGNN in graph-based fraud detection.

•

GCN [7]: graph convolutional network achieved by aggregating features in the neighborhood to generate node embeddings.
•

GAT [9]: graph attention network aggregates the neighbors with different aggregation weights calculated by attention mechanism.
•

GraphSAGE [8]: an inductive GNN model samples the neighbor by connection information and aggregates features by stacking layers.
•

DR-GCN [34]: a dual-regularized graph convolutional network to handle multi-class imbalanced graph representation learning.
•

CARE-GNN [15]: a fraud detection GNN model utilizes a similarity measure to enhance aggregation and reinforcement learning to obtain optimal selection count.
•

FRAUDRE [21]: a GNN model aggregates difference between neighbors and tackles with class imbalance.
•

PC-GNN [6]: a state-of-the-art graph-based fraud detection method, which proposed pick and choose operations to alleviate inconsistency and class imbalance.
•

DIGNN_∖S: a variant of DIGNN, which is trained on node feature and adjacency matrix directly without mini-batch sampling.
•

DIGNN_∖M: a variant of DIGNN, which removed the mutual information optimization from DIGNN.

TABLE I: The Statistic of Datasets.

dataset

Nodes

(fraud%)

Relation

#Edges

Class

#Class

YelpChi

45,954

(14.53%)

R-U-R

R-T-R

R-S-R

ALL

49,315

573,616

3,402,743

3,846,979

Fraudster

Benign

6,677

39,277

Amazon

11,944

(6.87%)

U-P-U

U-S-U

U-V-U

ALL

175,608

3,566,479

1,036,737

4,398,392

Fraudster

Benign

Unlabeled

821

7,818

3,305

TABLE II: Performance Comparison on YelpChi and Amazon.

Method	Dataset	Yelpchi			Amazon
Method	Metric	F1-macro	AUC	GMean	F1-macro	AUC	GMean
Baselines	GCN	0.4929 $\pm$ 0.0025	0.6274 $\pm$ 0.0034	0.1886 $\pm$ 0.0063	0.5461 $\pm$ 0.0363	0.8328 $\pm$ 0.0111	0.2570 $\pm$ 0.0789
	GAT	0.4879 $\pm$ 0.0230	0.5715 $\pm$ 0.0029	0.1659 $\pm$ 0.0789	0.6464 $\pm$ 0.0387	0.8102 $\pm$ 0.0179	0.6675 $\pm$ 0.1345
	GraphSAGE	0.4405 $\pm$ 0.1066	0.5439 $\pm$ 0.0025	0.2589 $\pm$ 0.1864	0.6416 $\pm$ 0.0079	0.7589 $\pm$ 0.0046	0.5949 $\pm$ 0.0349
	DR-GCN	0.5523 $\pm$ 0.0231	0.5921 $\pm$ 0.0195	0.4038 $\pm$ 0.0742	0.6488 $\pm$ 0.0364	0.8295 $\pm$ 0.0079	0.5357 $\pm$ 0.1077
	CARE-GNN	0.6075 $\pm$ 0.0128	0.7713 $\pm$ 0.0015	0.7023 $\pm$ 0.0044	0.8875 $\pm$ 0.0040	0.9398 $\pm$ 0.0032	0.8848 $\pm$ 0.0012
	FRAUDRE	0.5841 $\pm$ 0.0365	0.7427 $\pm$ 0.0084	0.6654 $\pm$ 0.0210	0.8806 $\pm$ 0.0320	0.9272 $\pm$ 0.0021	0.8808 $\pm$ 0.0049
	PC-GNN	0.6130 $\pm$ 0.0083	0.7715 $\pm$ 0.0005	0.7068 $\pm$ 0.0015	0.8557 $\pm$ 0.0227	0.9482 $\pm$ 0.0034	0.8952 $\pm$ 0.0044
Ablation	DIGNN_∖S	0.5120 $\pm$ 0.0027	0.6120 $\pm$ 0.0067	0.5895 $\pm$ 0.0010	0.7308 $\pm$ 0.0064	0.8913 $\pm$ 0.0020	0.8088 $\pm$ 0.0042
Ablation	DIGNN_∖M	0.6994 $\pm$ 0.0149	0.8389 $\pm$ 0.0128	0.7348 $\pm$ 0.0173	0.9186 $\pm$ 0.0029	0.9645 $\pm$ 0.0019	0.9195 $\pm$ 0.0013
Ours	DIGNN	0.7092 $\pm$ 0.0025	0.8526 $\pm$ 0.0067	0.7596 $\pm$ 0.0105	0.9189 $\pm$ 0.0045	0.9729 $\pm$ 0.0039	0.9281 $\pm$ 0.0038

V-A3 Settings.

The parameters of DIGNN are optimized with Adam [35] optimizer, the learning rate is set to 0.001, and weight decay is 0.0005, the training epochs are set to 50, the hidden dimension of node feature is set to 32, the scales of reconstruction learning rate of both topology and attribute are 0.05, the layer number of encoders are set to 2. The train, valid, and test ratio are set to be 40%, 20%, and 40% respectively. We use Scikit-learn [36] to implement train-test split, and the imbalance ratio is consistent in three sets. It is worth noting that to alleviate the influence of class imbalance, we employ down-sampling to train DIGNN.

For GCN, GAT, and GraphSage, they suffer from the class imbalance and inconsistency problem, and will always predict normal (negative) samples. Therefor, we follow PC-GNN to utilize threshold-moving strategy, and the classification threshold is set to be 0.2 for YelpChi and Amazon. For CARE-GNN, FRAUDRE, PC-GNN, we use the parameters introduced by authors.

V-A4 Implementation.

Our model DIGNN is implemented in Pytorch 1.7.0 [37]. For DIGNN, GCN, GAT, and GraphSage, we all implement them based on Pytorch Geometric 2.0.3 [38]. For CARE-GNN, FRAUDRE and PC-GNN, we carry out the source code provide by authors. All models are running on Python 3.8.12, 1 NVIDIA GeForce RTX 2080 GPU and 3.20 GHz Intel Xeon E5-2660 CPU.

V-A5 Metrics.

The fraud detection datasets display a skewed class distribution, so accuracy is not suitable to evaluate the effectiveness of fraud detection models. The evaluation metrics should have no bias to any class. Therefore, we use three common metrics, namely F1-macro, AUC and GMean. F1-macro is the unweighted mean of the F1-score of each class. AUC is the area under the ROC Curve.

\text{AUC}=\frac{\sum_{u\in\mathcal{U}^{+}}rank_{u}-\frac{|\mathcal{U}^{+}|\times(|\mathcal{U}^{+}|+1)}{2}}{|\mathcal{U}^{+}|\times|\mathcal{U}^{-}|}

Here, $\mathcal{U}^{+}$ and $\mathcal{U}^{-}$ denotes the minority and majority class set in the testing set, respectively. And $rank_{u}$ indicates the rank of node $u$ via the score of prediction. And GMean calculated the geometric mean of True Positive Rate (TPR) and True Negative Rate (TNR), it can be defined as,

\text{GMean}=\sqrt{\text{TPR}\cdot\text{TNR}}=\sqrt{\frac{\text{TP}}{\text{TP}+\text{FN}}\cdot\frac{\text{TN}}{\text{TN}+\text{FP}}}.

The higher scores of this three metrics indicate the higher performance of the approaches.

V-B Analysis of Attention Mechanism (RQ1)

To answer the RQ1, we analyze the attention values and visualize them for investigating whether the attention values learned by our model is meaningful. The attention changing trends are shown in Figure 4. The x-axis is the number of training epochs and y-axis is the average attention value. With the training epoch increasing, the difference between the corresponding attention values of topology and attribute begin to be striking. We can observe that DIGNN pays more attention on attribute and topology on YelpChi and Amazon datasets respectively. It demonstrates our model has a strong capability to extract the task-relevant information from these two views.

V-C Performance Comparison (RQ2)

To answer the RQ2, we compare the performance of DIGNN with state-of-the-art methods. The corresponding F1-macro, AUC and GMean scores are shown in Table II, we have the following two observations.

First, DIGNN significantly boosts the performance for all metrics on YelpChi and Amazon datasets than other SOTA baseline methods. In YelpChi dataset, our model obtains 9.62%, 8.11%, and 5.28% improvement respectively in F1-macro, AUC and GMean. We can observe that PC-GNN outperforms other baselines in most metrics, but our model can still surpass it by a significant margin. In Amazon dataset, graph-based fraud detection methods have already achieved high performance and the increasing room is limited. But our model can still get improvements, with 3.14% improvement in F1-macro, 2.47% improvement in AUC and 3.29% improvement in GMean. In addition, the relatively low standard deviation of DIGNN shows that our model is stable.

Second, the compared baseline methods can be divided into two groups, traditional MP-GNNs and graph-based fraud detection methods. GCN, GAT, GraphSAGE are tradition GNN models, and DR-GCN is designed for imbalanced node classes. They do not consider the inconsistency problem so that we can observe these models get poor performance on YelpChi and Amazon datasets. Because Amazon dataset has a more skewed label distribution (fraudsters only occupy 6.87% of all samples), more intra-class edges which are beneficial to traditional MP-GNNs are appeared in graph. Thus these methods have a relatively satisfactory performance on Amazon. CARE-GNN and PC-GNN are graph-based fraud detection methods, they both sample neighbors according to similarity measure, which can alleviate inconsistency problem to a certain degree. Therefore, they can perform better on these two datasets. However, sampling neighbor strategy may discard a lot of information, and it can lead to sub-optimal results. Instead, our DIGNN model abandons this practice, and disentangles original graph into topological structure and node attributes, then processes them in parallel.

In general, DIGNN outperforms all baselines in F1-macro, AUC and GMean on YelpChi and Amazon datasets, which can demonstrate the effectiveness of our model.

V-D Ablation Study (RQ3)

To answer the RQ3, we compare DIGNN with two corresponding variants DIGNN_∖S and DIGNN_∖M. The results of two datasets are shown in Table II. We can observe that DIGNN surpasses its variants in most of metrics. For DIGNN_∖M, its overall performance on Yelpchi and Amazon is inferior to complete model, which verifies the effectiveness of our proposed mutual information objective. It is noting that the DIGNN_∖M is on-par with DIGNN on Amazon dataset evaluated by F1-macro. This could attribute to smaller fraud rate of Amazon, thus DIGNN_∖M pay more attention to majority class without the guidance of mutual information compared with DIGNN. For DIGNN_∖S, we can observe that DIGNN is evidently better than model without sampling strategy. We suppose it is caused by the noise information of the structure view. Sampling strategy play a denoising effect on structural information to some extent.

V-E Sensitive Analysis (RQ4)

To answer RQ4, we further evaluate the performance of DIGNN with respect to the training ratio, hidden dimension $d$ and hyperparameters $\alpha,\beta$ . For training ratio, we vary the percentage of training nodes from 10% to 50%, and compare DIGNN with other two baselines, CARE-GNN and PC-GNN. Figure 5 shows the performance of F1-macro, AUC and GMean on YelpChi dataset. We can observe that DIGNN always achieves best performance among the three models. When the training ratio is 10%, DIGNN still performs better than PC-GNN training on 50% samples. And DIGNN surpasses CARE-GNN and PC-GNN by a large margin in AUC.

For hidden dimension $d$ , we study the performance of DIGNN with various hidden dimension number $d$ from 8 to 64. And the results are presented in Figure 6. With the increase of hidden dimension $d$ , the performances improve first, but then start to grow slowly. It is relatively stable with respect to hidden dimension $d$ .

For hyperparameters $\alpha$ and $\beta$ , we vary these two hyperparameters from 0 to 1, and the corresponding results are shown in Figure 7. Considering the limit space, we only present AUC performance on YelpChi and Amazon datasets. It can be observed that the optimal selection of these two hyper-parameters varies greatly on the different datasets. In the YelpChi dataset, higher AUC performance can be achieved by selecting larger $\beta$ ( $\beta\geq 0.8$ ). And in the Amazon dataset, larger $\alpha$ ( $\alpha\geq 0.6$ ) and smaller $\beta$ ( $\beta\leq 0.4$ ) can get a better result.

V-F Visualization

In order to show the effectiveness of different models more intuitively, we visualize the learned node embeddings on YelpChi dataset. Specifically, we compare our DIGNN model with the other three models. We select one traditional MP-GNN, GCN, and two graph-based fraud detection models, CARE-GNN and PC-GNN. Firstly, we use the 32-dimensional output embedding on the last layer of these models before softmax function. And then we employ the t-SNE [39] to map the 32-dimensional embedding into 2-dimensional space for visualization. Because of the imbalanced class distribution, we randomly sample the same number of benign samples as fraudster samples for better visibility. The results of YelpChi are showed in Figure 8, and orange dots represent fraudsters, blue dots represent benign entities.

We can observe that due to the strong inductive bias of homophilic, GCN is disable to learn discriminative node embeddings. For CARE-GNN and PC-GNN, although they alleviate the inconsistency problem, they still fail to seperate the embeddings of fraudsters from that of the benign entities. Conversely, DIGNN achieves inter-class separation obviously, the overlap of the two kinds of nodes is relatively small. Consequently, it can verify the effectiveness of our proposed DIGNN model.

VI Conclusion

In this paper, we suggest that disentangling operation is beneficial to alleviate the inconsistency problem in fraud network. In order to decrease the conflict between topological structure and node attribute, we propose a simple yet effective model named DIGNN. It firstly disentangles the attribute fraud network into topology and attribute two views. Then DIGNN fuses two kinds of view information adaptively by attention mechanism, which can effectively extract task-relevant information. Moreover, we design a novel optimization objective to further reduce the entanglement between these two view-specific embeddings and maintain their semantic information. Experiment results demonstrate that DIGNN outperforms state-of-the-art methods on two real-world graph fraud detection datasets.

VII acknowledge

This work is jointly sponsored by National Natural Science Foundation of China (U19B2038, 62141608, 62206291) and CCF-AFSG Research Fund (20210001).

References

[1] A. Li, Z. Qin, R. Liu, Y. Yang, and D. Li, “Spam review detection with graph convolutional networks,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 2703–2711.
[2] Y. Dou, K. Shu, C. Xia, P. S. Yu, and L. Sun, “User preference-aware fake news detection,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2051–2055.
[3] W. Xu, J. Wu, Q. Liu, S. Wu, and L. Wang, “Mining fine-grained semantics via graph neural networks for evidence-based fake news detection,” arXiv preprint arXiv:2201.06885, 2022.
[4] L. Deng, C. Wu, D. Lian, Y. Wu, and E. Chen, “Markov-driven graph convolutional networksfor social spammer detection,” IEEE Transactions on Knowledge and Data Engineering, 2022.
[5] D. Wang, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, and Y. Qi, “A semi-supervised graph attentive network for financial fraud detection,” in 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019, pp. 598–607.
[6] Y. Liu, X. Ao, Z. Qin, J. Chi, J. Feng, H. Yang, and Q. He, “Pick and choose: a gnn-based imbalanced learning approach for fraud detection,” in Proceedings of the Web Conference 2021, 2021, pp. 3168–3177.
[7] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[8] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
[9] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” stat, vol. 1050, p. 20, 2017.
[10] H. Nt and T. Maehara, “Revisiting graph neural networks: All we have is low-pass filters,” arXiv preprint arXiv:1905.09550, 2019.
[11] Z. Liu, Y. Dou, P. S. Yu, Y. Deng, and H. Peng, “Alleviating the inconsistency problem of applying graph neural network to fraud detection,” in Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp. 1569–1572.
[12] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah, “A unified view on graph neural networks as graph signal denoising,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 1202–1211.
[13] L. Yang, W. Zhou, W. Peng, B. Niu, J. Gu, C. Wang, X. Cao, and D. He, “Graph neural networks beyond compromise between attribute and topology,” 2022.
[14] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adaptive multi-channel graph convolutional networks,” in Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining, 2020, pp. 1243–1253.
[15] Y. Dou, Z. Liu, L. Sun, Y. Deng, H. Peng, and P. S. Yu, “Enhancing graph neural network-based fraud detectors against camouflaged fraudsters,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 315–324.
[16] Y. Wang, J. Zhang, S. Guo, H. Yin, C. Li, and H. Chen, “Decoupling representation learning and classification for gnn-based anomaly detection,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1239–1248.
[17] X. Ao, Y. Liu, Z. Qin, Y. Sun, and Q. He, “Temporal high-order proximity aware behavior analysis on ethereum,” World Wide Web, vol. 24, no. 5, pp. 1565–1585, 2021.
[18] T. Liang, G. Zeng, Q. Zhong, J. Chi, J. Feng, X. Ao, and J. Tang, “Credit risk and limits forecasting in e-commerce consumer lending service via multi-view-aware mixture-of-experts nets,” in Proceedings of the 14th ACM international conference on web search and data mining, 2021, pp. 229–237.
[19] G. Zhang, Z. Li, J. Huang, J. Wu, C. Zhou, J. Yang, and J. Gao, “efraudcom: An e-commerce fraud detection system via competitive graph neural networks,” ACM Transactions on Information Systems (TOIS), vol. 40, no. 3, pp. 1–29, 2022.
[20] X. Ma, J. Wu, S. Xue, J. Yang, C. Zhou, Q. Z. Sheng, H. Xiong, and L. Akoglu, “A comprehensive survey on graph anomaly detection with deep learning,” IEEE Transactions on Knowledge and Data Engineering, 2021.
[21] G. Zhang, J. Wu, J. Yang, A. Beheshti, S. Xue, C. Zhou, and Q. Z. Sheng, “Fraudre: Fraud detection dual-resistant to graph inconsistency and imbalance,” in 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021, pp. 867–876.
[22] C. Liu, L. Sun, X. Ao, J. Feng, Q. He, and H. Yang, “Intention-aware heterogeneous graph attention networks for fraud transactions detection,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 3280–3288.
[23] C. Liu, L. Wen, Z. Kang, G. Luo, and L. Tian, “Self-supervised consensus representation learning for attributed graph,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2654–2662.
[24] D. Lim, F. Hohne, X. Li, S. L. Huang, V. Gupta, O. Bhalerao, and S. N. Lim, “Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[25] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” arXiv preprint arXiv:2002.07017, 2020.
[26] F. Bao, “Disentangled variational information bottleneck for multiview representation learning,” in CAAI International Conference on Artificial Intelligence. Springer, 2021, pp. 91–102.
[27] Z. Wan, C. Zhang, P. Zhu, and Q. Hu, “Multi-view information-bottleneck representation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 10 085–10 092.
[28] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in International conference on machine learning. PMLR, 2018.
[29] T. M. Cover, J. A. Thomas et al., “Entropy, relative entropy and mutual information,” Elements of information theory, vol. 2, no. 1, pp. 12–13, 1991.
[30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of machine learning research, vol. 11, no. 12, 2010.
[31] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[32] S. Rayana and L. Akoglu, “Collective opinion spam detection: Bridging review networks and metadata,” in Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, 2015, pp. 985–994.
[33] J. J. McAuley and J. Leskovec, “From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews,” in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 897–908.
[34] M. Shi, Y. Tang, X. Zhu, D. Wilson, and J. Liu, “Multi-class imbalanced graph convolutional network learning,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), 2020.
[35] K. Diederik, B. Jimmy et al., “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, pp. 273–297, 2014.
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
[37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
[38] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
[39] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.