Fake Reviewer Group Detection in Online Review Systems

1^st Chen Cao Dalian University of Technology
Dalian, China
[email protected] 2^nd Shihao Li Dalian University of Technology
Dalian, China
[email protected] 3^rd Shuo Yu^🖂 Dalian University of Technology
Dalian, China
[email protected] 4^th Zhikui Chen Dalian University of Technology
Dalian, China
[email protected]

Abstract

Online review systems are important components in influencing customers’ purchase decisions. To manipulate a product’s reputation, many stores hire large numbers of people to produce fake reviews to mislead customers. Previous methods tackle this problem by detecting malicious individuals, ignoring the fact that the spam activities are often formed in groups, where individuals work collectively to write fake reviews. Fake reviewer group detection, however, is more challenging due to the difficulties in capturing the underlying relationships in groups. In this work, we present an unsupervised and end-to-end approach for fake reviewer group detection in online reviews. Specifically, our method can be summarized into two procedures. First, cohensive groups are detected with modularity-based graph convolutional networks. Then the suspiciousness of each group is measured by several anomaly indicators from both individual and group levels. The fake reviewer groups can be finally detected through suspiciousness. Extensive experiments are conducted on real-world datasets, and the results show that our proposed method is effective in detecting fake reviewer groups compared with the state-of-the-art baselines.

Index Terms:

group anomaly detection, graph learning, fake review

I Introduction

Recent decades have witnessed rapidly increasing popularity in online review systems. More and more people write reviews and refer to reviews on websites such as Yelp, Amazon before purchasing since they give customers useful information and first-hand experiences about goods. Therefore, online reviews weigh heavily in promoting sales. However, this financial importance also lures some stores and organizations to hire people to write quantities of fake reviews. This activity is commonly called review spam. By indirectly promotes or demotes a product’s reputation, review spam effectively affects revenues in business.

This malicious intrusion into online review systems has attracted much attention from researchers. The problem of detecting such users was first formulated by Jindal et al.[1], and then various solutions have been proposed[2][3][4]. Traditional methods leverage handcrafted individual features to classify reviews or reviewers. The features commonly used can be sorted into three categories: behavior-based features[2][5][6] that characterize reviewers’ behavior, language-based[3][7] features that utilize the linguistic patterns of review content, and relation-based[4] features that capture the underlying relationships among users via graphs.

Most previous studies concentrate on detecting individual fake reviewers, but in real world, fake reviewers often participate in spam activities together as a group. And fake reviewer groups can do much more harm to online review systems comparing to individual ones. However, it is non-trivial to detect fake reviewer groups for two reasons. First, a user can appear normal if analyzed at the individual level, but his or her suspiciousness may only be evident when viewing collectively with some other users. Second, target products and participants of spam activities often overlap among different fake reviewer groups, which makes it difficult to accurately discovering such members. Previous approaches in faker reviewer group detection [8] have proposed several group suspiciousness determinants that capture group-level features based on graphs. However, they do not harness group-level and individual-level features together, leaving out many valuable footprints from metadata. Also, the sizes of detected groups in these methods are mostly quite small [9]. Meanwhile, studies have shown that the rising of crowdsourcing platforms accelerates organizing large groups to conduct powerful malicious web activities [10], including review spam [11][12], while small reviewer groups can hardly have significant impacts on review systems by contrast. Therefore, large fake reviewer groups are more common in reality, not to mention that detected groups with small sizes have a higher possibility to be generated purely by accident.

The aforementioned graph-based models all exploit conventional graph approaches. With progress in Graph Neural Networks (GNNs), many researchers apply GNNs to the review spam problem. Unlike previous approaches, GNN-based methods can aggregate messages from neighbors to represent nodes. Most existing works in review spam leverage GNNs for node representation and classification, and this neglects its capability in graph pooling and clustering. Meanwhile, studies have shown that GNNs can utilize higher-order structural information generated by clusters [13].

In this paper, we propose a fake reviewer group detection method named REAL (modulaRity basEd grAph cLustering for fake reviewer group detection). In detail, REAL first employs GNNs and spectral modularity for graph clustering to find candidate groups. Next, it measures each group’s suspiciousness by considering both the group-level indicators and the average values of individual-level indicators collectively via anomaly scores. Based on the ranking of anomaly scores, REAL can extract the most suspicious fake reviewer groups. Specifically, we refer to modularity metrics to discover overlapping clusters, which can better characterize real-world review spam scenarios. Also, by setting the number of clusters properly, we can extract groups with large sizes. Furthermore, we elaborate a group-level anomaly indicator Group Anomaly Compactness that characterizes the closeness in the relation among group members. REAL is comprehensively effective at detecting large fake reviewer groups with high precision.

Refer to caption — Figure 1: An example of a fake reviewer group’s spam activity in two restaurants.

In summary, our contributions are three-fold:

•

We present REAL, an unsupervised and end-to-end fake reviewer group detection model leveraging modularity-based graph clustering. REAL introduces spectral modularity for overlapping clustering to facilitate detecting candidate fake reviewer groups.
•

We unify both group-level and individual-level anomaly indicators together to compute an anomaly score for each candidate group. These indicators harness clues from users’ behaviors, temporal patterns, and latent relational closeness to evaluate the suspiciousness of a group.
•

We conduct extensive experiments on three real-world datasets, namely, YelpNYC, YelpZip, and YelpCHI. The results show that REAL is effective at detecting fake reviewer groups and outperforms three baselines by striking a balance between precision and group size.

The rest of this paper is organized as follows. Section II goes through the related work in review spam detection. In Section III, we first introduce some preliminaries of our model REAL, and then demonstrate its framework. Section IV discribes our experiments and results. The conclusion of our work is in in Section V.

II Related Work

II-A Fake Reviewer Detection

Since Jindal and Liu[1] propose the fake review detection problem, fake reviews in online shopping websites have gained increasing concern. Most previous studies in this area can be categorized into the following types: behavior-based, linguistics-based, and relation-based. For behavior-based methods, Li et al.[2] consider semi-supervised models that utilize reviews’ and users’ views by co-training processes. Xie et al.[6] study the temporal pattern of users’ behavior to uncover fake reviewers. For linguistics-based methods, Feng et al.[3] exploit features generated from Context Free Grammar parse trees to make improvements. Neiterman et al.[7] develop a multilingual speech-based detection model. Considering graphs’ advantages in capturing inter-dependent properties from metadata, some proposed graph-based approaches. Wang et al.[14] present a heterogeneous graph model and use an iterative approach to discover the propagation in users. Li et al.[15] build a user-IP-review graph that connects reviews from the same users and IPs.

Most existing methods focus on fake reviewer detection at the individual level rather than the group level. However, most review spam activities are well-organized and thus fake reviewers form collusive groups, especially along with the rising of crowdsourcing/crowdturfing systems [11]. Fake reviewer groups are more powerful in distorting a product’s reputation. They are also much more difficult to discover as they may appear benign at the individual level. Mukherjee et al.[16] propose a frequent itemset mining method to detects fake reviewer groups. Wang et al.[8] generate loose fake reviewer groups from graphs in a divide and conquer manner. Wang et al.[9] later introduce a new approach that decompose the entire reviewer graph into small fake reviewer groups by a minimum cut algorithm. Dhawan et al.[17] propose DeFrauder that harnesses group indicators from behavior and graph features to discover and rank fake reviewer groups. However, they do not consider group-level and individual-level features in a unified manner, ignoring the effectiveness of traditional features in group detection.

II-B Graph Neural Networks

With the resurgence in deep learning, researchers have begun to apply GNNs to graph-based anomaly detection[18]. Compared with traditional approaches, GNNs can effectively aggregate information from neighbors and extract features in graphs. Basic GNNs [19] complete aggregation via the mean function and degree of neighbors, after which many improved algorithms followed. GraphSAGE [20] merges messages of self-node features from previous layers and introduces the concept of sampling in GNNs to alleviate the cost problem in computing. GATs [21] use a pair-wise function on nodes that updates weights to learn node features.

Besides the application in fake news detection[22], financial assessment[23], many researchers also use GNNs for review spam problem. For example, Wang et al.[24] trains a GCN[19] model to find fraudsters. Li et al.[25] constructs two graphs and uses GCN to learn reviews’ local and global context. Dou et al. [26] strengthens the aggregation process in GNN to defend from camouflages. However, GNNs in fake review detection are mostly used for node representation and individual reviews/reviewers classification. This ignores the problem of fake reviewer groups and fails to recognize GNNs’ effectiveness in graph pooling especially graph clustering[27]. Moreover, Wang et al.[13] demonstrates that jointly performing representation and classification with GNNs would deteriorate training effectiveness if there is users’ behavior pattern and label semantics are not conformed, which is a common case in real-world settings.

Compared with previous models, REAL contributes comprehensively to the issues above. It uses GNNs as a building block and utilizes information from users’ behavior and graph structure collectively to discover the most suspicious groups.

III Preliminaries

III-A Graph Construction

We construct an attributed ”product-rating” graph $\mathcal{G}=\mathcal{(V,E)}$ that aims to incorporate more information from metadata. As demonstrated in Fig. 2, a node $v_{ij}\in\mathbf{V}$ represents a product $p_{i}$ and one kind of its ratings $r_{j}$ , namely, a product-rating pair $(p_{i},r_{j})$ . The node attribute $\mathbf{a}^{n}_{ij}$ is defined as the reviewer set where each member has given product $p_{i}$ the rating $r_{j}$ . The edge $e_{(ij,mn)}$ indicates reviewer sets in which members share identical rating behavior on the two products $p_{i},p_{m}$ , and the edge attribute $\mathbf{a^{e}_{(ij,mn)}}$ denotes the co-review and co-rating reviewer set $\mathcal{R}_{(ij,mn)}$ . For example, if reviewer Alice and reviewer Bob both gave $p_{i}$ the score of $r_{j}$ and $p_{m}$ the score of $r_{n}$ , then Alice and Bob should be included in the $\mathcal{R}_{(ij,mn)}$ . Note that we assume that each user can only give a product one rating.

III-B Spectral Modularity for Clustering

We first extract candidate groups based on cluster results before using group-level anomaly indicators. We handle this task by using methods in graph clustering, which is also commonly known as dividing networks into multiple communities. A community in networks is popularly defined as a group that has tighter relation within, which aligns with our description of fake reviewer groups.

Traditional approaches in graph clustering use cut-based metrics to detect communities in graphs[28]. However, the good cuts these methods assume[29] are vulnerable as one node may belong to more than one community. In our scenario, a product could be involved in several spam activities and a fake reviewer could participate in several collusive groups. Meanwhile, modularity[30] based metrics tackle this problem by maximizing intra-cluster relation and minimizing inter-cluster relation. Specifically, given each node $i$ its degree $d_{i}$ , a random graph with $m$ edges is generated, and the node pair( $u,v)$ is linked with probability $d_{u}d_{v}/2m$ . Modularity is then calculated as follows:

Q=\frac{1}{2m}\sum_{ij}[A_{ij}-\frac{d_{i}d_{j}}{2m}\delta(c_{i},c_{j})]

(1)

\delta(u,v)=\begin{cases}1&u=v\\ 0&\text{otherwise}\end{cases}

(2)

where $c_{x}$ denotes the cluster a node is assigned to. Modularity calculates the divergence of the edges within a cluster from the expected one, and we harness it to characterize and detect overlapping collusive groups in $\mathcal{G}$ .

To solve the NP-hard problem of maximizing the modularity, let $d$ be the degree vector, modularity matrix $\mathbf{C}\in 0,1^{n\times k}$ be the cluster assignment and $\mathbf{B}=\mathbf{A}-\frac{\mathbf{d}\mathbf{d}^{\top}}{2m}$ , and modularity can be refined as

\mathbf{Q}=\frac{1}{2m}\text{Tr}(\mathbf{C}^{\top}\mathbf{B}\mathbf{C})

(3)

where Tr(*) denotes the trace of matrix *.

While this process utilizes completely the structure of graphs, we still face the cost problem when adapting it to our attributed graphs.

IV Methodology

In this part, we dilate on the framework of REAL. It first extracts candidate groups on the constructed graph by graph clustering with deep modularity networks. Then REAL measures the suspiciousness of candidate groups by computing anomaly scores combining group-level indicator Group Anomaly Compactness and effective individual-level indicators. After ranking the anomaly score of all candidate groups, the most suspicious groups are detected. The overall framework is illustrated in Fig.3.

IV-A Graph Convolutional Networks

The aim of generalizing convolutions to graph domain is to encode the nodes with signals in the receptive fields. Given a graph $\mathcal{G}=(\mathcal{V,\mathcal{E}})$ with nodes count $N=|\mathcal{V}|$ , where each node is represented with a feature vector. This way the convolution operator can be represented by Hadamard product in the Fourier domain.

For a GCN with $L$ convolutional layers, each layer embeds nodes by aggregating its neighbors from previous layer. Consider the common practice for a GCN:

\mathbf{H}^{(t+1)}=\sigma(\widetilde{\mathbf{A}}\mathbf{H}^{(t)}\mathbf{W}^{(t)})

(4)

where $\widetilde{\mathbf{A}}=\hat{\mathbf{D}}^{-\frac{1}{2}}\hat{\mathbf{A}}\hat{\mathbf{D}}^{-\frac{1}{2}}$ , $\hat{\mathbf{A}}=\mathbf{A}+\mathbf{I}$ , $\hat{\mathbf{D}}$ is the diagonal node degree matrix of $\hat{\mathbf{A}}$ . $\mathbf{H}^{(0)}=\mathbf{X}$ is the input in the first layer. $\mathbf{X}$ is the feature matrix. $\mathbf{W}^{(t)}$ is a learnable matrix shared among all nodes at layer $t$ , and $\sigma(\cdot)$ is the activation function.

In REAL, we use SeLU[31] for activation, which is proven to benefit convergence. Also, we skip the self-loop and insert a weight matrix $\mathbf{W}_{s}$ for skip connection.

Therefore, the $t$ -th output layer for $\mathbf{H}^{(t+1)}$ is

\mathbf{H}^{(t+1)}=\text{SeLU}(\mathbf{\widetilde{A}}\mathbf{H}^{(t)}\mathbf{W}+\mathbf{H}\mathbf{W}_{s})

(5)

IV-B Deep Modularity Network

To conduct graph clustering on $\mathcal{G}$ and select candidate collusive groups, we refer to deep modularity networks that use modularity-based loss function for optimization. In detail, the model gets the cluster assignment matrix via softmax on graph convolution networks. We obtain $\mathbf{C}$ as follows:

\mathbf{C}=\text{softmax}(GCN(\mathbf{\widetilde{A}},\mathbf{X}))

(6)

We then introduce spectral modularity in its optimization function. However, allocating every node to the same cluster may trigger the local minima problem which harms optimization[32]. Therefore we assign spectral clustering a regularization to guarantee the informativeness in clustering. To avoid being too restrictive when performed with softmax, the collapse regularization is relaxed as $\frac{\sqrt{k}}{n}\left\|\sum_{i}\mathbf{C}_{i}^{\top}\right\|_{F}-1$ , where $\left\|\cdot\right\|_{F}$ is the Frobenius norm, $n$ is the number of nodes and $k$ is the number of the clusters.

So the complete loss function is defined as follows:

\mathcal{L}=-\frac{1}{2m}\text{Tr}(\mathbf{C}^{\top}\mathbf{B}\mathbf{C})+\frac{\sqrt{k}}{n}\left\|\sum_{i}\mathbf{C}_{i}^{\top}\right\|_{F}-1

(7)

IV-C Anomaly Indicators

Our next step is to evaluate candidate groups with anomaly indicators and sort out the most suspicious ones. After graph clustering, nodes in the graph $\mathcal{G}$ are assigned to the clusters. We then intersect the reviewer sets of targeted products in the same clusters, which are also the attributes of nodes, and get the group members of each cluster. To grasp messages from reviewers’ behavior, content and underlying relation, we adopt the indicators discussed below. Note that according to Xu et al.[33], language features can be easily imitated by fake reviewers. Furthermore, many online review systems only allow rating for products without comments in texts. Therefore, in our settings, we focus on rating behaviors and users’ temporal characteristics, discarding the language features.

IV-C1 Group Anomaly Compactness Indicator

As members in fake reviewer groups may appear benign when considered separately but become conspicuous outliers in a group context, we attach greater importance to group-level indicators than individual ones. We employ several group anomaly indicators that measure behavioral suspiciousness in group context[8]. After experimenting on previously illustrated group indicators, and we find only a small fraction of them characterizes fake reviewer groups with high accuracy. To efficiently demonstrate fake reviewer groups, we summarize these group indicators and elaborate a group-level indicator that quantifies the closeness within a group, namely, Group Anomaly Compactness. It effectively fuses the information about a group’s product set, review set and members’ behaviors. Note that the formation of small groups is highly likely to be coincident because it’s common for two or three people to co-review a product by chance, but by design for large groups. Therefore, a penalty function is designed as follows:

L(g)=\frac{1}{1+e^{-(|\mathcal{R}(g)+|\mathcal{P}(g)|-3})}

(8)

where $\mathcal{R}(g)$ is the reviewers in group $g$ , and $\mathcal{P}(g)$ is the respective products rated by the users in $\mathcal{g}$ . $\mathcal{P}_{i}$ is the i-th product set a user has reviewed. We then measure the group-level suspiciousness via the following indicators and then compute Group Anomaly Compactness.

•

Review Tightness (RT) If a large proportion of people jointly review some products, it indicates spam group activities. Given a candidate fake reviewer groups $g$ , review tightness calculates its ratio of the total count of reviews to the product of the product set and the user set multiplies $L(g)$ :

$RT(g)=\frac{\sum_{i\in{\mathcal{R}(g)}}|\mathcal{P}_{i}|}{|\mathcal{R}(g)||\mathcal{P}(g)|}\cdot L(g)$ (9)
•

Product Tightness (PT) When a group focuses on certain products while reviewing few other ones, it is likely to be a fake reviewer group as members in normal groups would rate a variety of goods and the intersection of the product they review wouldn’t be that large. Given a group $g$ , its product tightness is the number of products commonly rated by the group $g$ to the total amount of products rated by all the members:

$PT(g)=\frac{|\cap_{r\in{\mathcal{R}_{g}}}\mathcal{P}_{i}|}{|\cup_{r\in{\mathcal{R}_{g}}}\mathcal{P}_{i}|}$ (10)
•

Neighbor Tightness (NT) When the products two groups reviewed are similar, both groups are possible fake reviewers because it’s rare for two groups of people to share highly similar product sets in normal scenarios. Therefore, we define neighbor tightness as the average of Jaccard Similarity (JS) of the product sets in each user pairs:

$NT(g)=\frac{2\sum_{i,j\in{\mathcal{R}(g)}}JS(\mathcal{P}_{i},\mathcal{P}_{j})}{\mathop{|\mathcal{R}_{g}|}}L(g)$ (11)

Then with the following group-level indicators, we can compute Group Anomaly Compactness $\Pi$ as follows:

$\Pi=RT(g)*PT(g)*NT(g)$ (12)

IV-C2 Individual Anomaly Indicators

Though members in a fake reviewer group may appear normal at the individual level, we still exploit clues from the users’ behavioral and temporal features which are complementary to our work. Note that we would give more weight to Group Anomaly Compactness than individual-level indicators when computing the final anomaly scores.
•

Average User Rating Deviation (AVD) Defining rating deviation $d_{ij}$ as the absolute deviation of $i$ ’s rating from the product $j$ ’s average rating[2], average user rating deviation is the average value of the reviews’ deviation that belongs to the reviewer.

•

Burstness (BST) Fake reviewers often appear during a short term on the website while benign users are often active for a longer term.

BST(i)=\left\{\begin{array}[]{lcl}0&&if\;E(i)-F(i)>\tau,\\ 1-\frac{E(i)-F(i)}{\tau}&&otherwise\\ \end{array}\right.

(13)

where $F(i)$ is the date of user $i$ ’s first review and $E(i)$ is the date of $i$ ’s last review. $\tau$ is the threshold. In our work, we set it to 30 days.

IV-D Candidate Groups Ranking

Given a candidate group $g$ and its members $r_{i}\in\mathcal{R}(g)$ , we measure its suspiciousness by ranking the anomaly score $\Omega$ . To be specific, after calculating its scores of Group Anomaly Compactness $\Pi$ and the individual-level indicators, each score is scaled between 0 and 1 with min-max Normalization. Then $g$ ’s anomaly score $\Omega$ is calculated as follows:

\Omega=3*\Pi+\frac{\sum_{r_{i}\in\mathcal{R}(g)}ARD(r_{i})}{|\mathcal{R}(g)|}+\frac{\sum_{r_{i}\in\mathcal{R}(g)}BST(r_{i})}{|\mathcal{R}(g)|}

(14)

where Group Anomaly Compactness multiples 3 before added to the final anomaly scores as we emphasize more on group-level anomaly messages.

The higher a group’s anomaly score is, the more suspicious it is to be a fake reviewer group.

V Experiments

V-A Datasets

We evaluate our method on three real-world datasets from Yelp.com collected by [34]. YelpNYC contains reviews for restaurants in New York City. YelpCHI, collected by [35], comprises reviews for a set of hotels and restaurants in the Chicago area. YelpZip collects restaurant reviews from plenty of areas ordered by zip code. Details for each datasets is illustrated in Table I. Note that the labels in the datasets are near-ground-truth since they are generated by Yelp’s filtering algorithms.

Dataset	#Reviews	#Reviewers	#Products
YelpNYC	359,052	160,225	923
YelpCHI	67,395	38,063	201
YelpZip	608,598	260,227	5,044

TABLE I: Statistics of three datasets

V-B Baeselines

We compare our methods with three baselines:

•

GraphStrainer by Ye et al. [36] first uses a graph-based method to discover target products and then detects fake reviewer groups via hierarchical clustering based on induced subnets. It does not use any handcrafted features.
•

ColluEagle by Wang et al.[37] is a markov random field based method exploiting group-level behavior features. It uses a density-based clustering approach[38] to extract candidate groups rather than deep approaches. Note that in our experiments we use vanilla ColluEagle without node prior.
•

DeFrauder by Dhawan et al.[17] first uses group indicators to extract groups, and then performs graph embedding via node2vec[39] to calculate density-based spam scores. DeFrauder exploits only group features, ignoring individual-level ones.

V-C Evaluation

With only labels about the truthfulness of each review, it is hard to decide whether several people have worked collaboratively. We consider that the more fake reviewers in a group, the more suspicious the group is. Therefore, we define the precision of results as the ratio of fake reviewers in the group to all group members. In our experiments, we mark the user that has written at least one fake review as fake.

Method	YelpNYC		YelpZip		YelpCHI
Method	Group Size	Precision	Group Size	Precision	Group Size	Precision
DeFrauder	133	0.1955	20	0.4500	21	0.4762
GroupStrainer	23	0.5652	26	0.5769	24	0.4583
ColluEagle	16	0.5625	17	0.8235	16	0.625
REAL	25	0.7600	39	0.5641	24	0.6667

TABLE II: Group size and precision results of the top group in each algorithm. Since ColluEagle fails to discover fake reviewer groups with more than 20 people in all 3 datasets, we choose the largest group in its results for comparison.

We consider detected groups with small sizes are much more likely to form by coincidence and also have only slight effects most of the time, while large fake reviewer groups are more common and harmful as mentioned before. Meanwhile, according to Dunbar’s number, a person’s social circle typically consists 15 good friends and 5 best friends. Considering the sum of these two sets of people as a small social clique, we set 20 as the lower bound for a relatively both large and tight fake reviewer group, and results with less than 20 members will not be taken into account. However, many existing group-level detecting methods[8, 9] can only uncover groups with less than 10 people, especially 2 to 5 people[9], and few detected groups have more than 20 members. Hence we compare the precision of the detected groups that ranked first among the outputs with more than 20 members to alleviate these issues in the baseline for better comparison. GroupStrainer is a cluster-based method, so we select the group that has the highest precision among those have more than 20 members.

V-D Experimental Results

For deep modularity networks in the clustering step, we analyze the effect of the number of clusters on YelpNYC and YelpZip dataset since it can significantly affect the group size in final results, which we consider crucial. We set $k$ in collapse regularization to 0.5, dropout to 0.5 and compare the precision of clustering results, skipping measuring indicators. The precision we calculate at this step is the mean precision value of the top 10 clusters, which is also the optimal situation for the next stage’s detection. Particularly, we present the cluster precision results on YelpNYC and YelpZip.

Method	YelpNYC		YelpZip		YelpCHI
Method	Group Size	Precision	Group Size	Precision	Group Size	Precision
DeFrauder	133	0.1955	20	0.4500	21	0.4762
GroupStrainer	23	0.5652	26	0.5769	24	0.4583
ColluEagle	16	0.5625	17	0.8235	16	0.625
REAL	25	0.7600	39	0.5641	24	0.6667

TABLE III: Group size and precision results of the top group in each algorithm. Since ColluEagle fails to discover fake reviewer groups with more than 20 people in all 3 datasets, we choose the largest group in its results for comparison.

As demonstrated in Fig.4, when the cluster number is relatively small, the cluster precision goes up as the number of clusters gets greater. This could be due to that greater cluster numbers lead to smaller group sizes, which is beneficial for theoretical precision while its practical meanings are doubtful. When the cluster number is relatively large, the cluster precision goes down. The reason is that after clustering target product, we intersect the reviewer sets of the products to get candidate groups. Too many clusters will lead to fewer detected groups, and the more clusters we divide, the smaller the group size is. So a decrease in the total number could negatively affect the precision.

We illustrate the distribution of the size of detected groups in Fig.5, This shows that the sizes of most detected groups are among 20 to 150. After ranking the top 100 fake reviewer groups, the sizes gather around 20 to 100 as shown in Fig.6. This could be due to the difficulty in guaranteeing precision when extracting large sizes of groups.

Finally, we conduct extensive experiments on YelpNYC, YelpZip, YelpCHI with our baselines, and the precision result of the chosen group as presented in Table II. The results show that REAL achieves comprehensively good results on all three datasets.

Though ColluEagle achieves high precision in the results, it does not extract any group of large sizes and most groups it detects involve only 2 to 4 people. We cast doubt on the effectiveness of such results because in practice it is a rare case for a fake reviewer group to involve 2 to 4 people and groups with small sizes may be generated by chance. DeFrauder is able to detect many groups with large sizes, but its precision is relatively low compared with REAL. GroupStrainer performs comprehensively, but is still inferior to REAL. Meanwhile, we also notice that REAL’s precision is significantly lower in YelpZip compared with ColluEagle, we analyze that since YelpZip is especially larger than the other two, it could be that GCN does not use the same local filter to scan every node and the weights in the filters are the same for all neighboring nodes in the receptive field. As a result, REAL’s performance in large graphs is hindered.

Compared with baselines, REAL makes a satisfying balance between the group size and precision.

VI Conclusion

Online review systems are susceptible to review spam, and such activities at the group level can cause detrimental effects because large fake reviewer groups are capable of manipulating the products’ overall reputation. So uncovering such groups is crucial. In real-world settings, overlapping of target products and members in different groups is a common scenario for spam activities. In this work, we demonstrate REAL, a fake reviewer detection approach via modularity-based graph clustering. REAL is an unsupervised model that can be trained end-to-end. It introduces the concept of spectral modularity into GNNs and performs graph clustering to find out candidate groups. REAL then measures the suspiciousness of each group by unifying group-level and individual-level indicators collaboratively. We validate the effectiveness of our method and compare the experimental results with three baselines, and REAL outperforms all of them by discovering the most suspicious and relatively large group. In the future, we plan to improve the precision on larger sizes of detected groups, and also reduce the computational cost in the procedures.

References

[1] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proceedings of the 2008 International Conference on Web Search and Data Mining, ser. WSDM ’08, 2008, pp. 219–230.
[2] F. H. Li, M. Huang, Y. Yang, and X. Zhu, “Learning to identify review spam,” in Twenty-second International Joint Conference on Artificial Intelligence, ser. IJCAI ’11, 2011.
[3] S. Feng, R. Banerjee, and Y. Choi, “Syntactic stylometry for deception detection,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ser. ACL ’12, 2012, p. 171–175.
[4] L. Akoglu, R. Chandy, and C. Faloutsos, “Opinion fraud detection in online reviews by network effects,” in Proceedings of the International AAAI Conference on Web and Social Media, ser. AAAI ’13, vol. 7, no. 1, 2013.
[5] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. Castellanos, and R. Ghosh, “Spotting opinion spammers using behavioral footprints,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’19, 2013, pp. 632–640.
[6] S. Xie, G. Wang, S. Lin, and P. S. Yu, “Review spam detection via temporal pattern discovery,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’12, 2012, pp. 823–831.
[7] E. Hershkovitch Neiterman, M. Bitan, and A. Azaria, “Multilingual deception detection by autonomous agents,” in Companion Proceedings of the Web Conference 2020, ser. WWW ’20, 2020, p. 480–484.
[8] Z. Wang, T. Hou, D. Song, Z. Li, and T. Kong, “Detecting review spammer groups via bipartite graph projection,” The Computer Journal, vol. 59, no. 6, pp. 861–874, 2016.
[9] Z. Wang, S. Gu, X. Zhao, and X. Xu, “Graph-based review spammer group detection,” Knowledge and Information Systems, vol. 55, no. 3, pp. 571–597, 2018.
[10] M. Motoyama, D. McCoy, K. Levchenko, S. Savage, and G. M. Voelker, “Dirty jobs: The role of freelance labor in web service abuse,” in Proceedings of the 20th USENIX Conference on Security, ser. SEC’11, 2011, p. 14.
[11] G. Wang, C. Wilson, X. Zhao, Y. Zhu, M. Mohanlal, H. Zheng, and B. Y. Zhao, “Serf and turf: Crowdturfing for fun and profit,” in Proceedings of the 21st International Conference on World Wide Web, ser. WWW ’12, 2012, p. 679–688.
[12] A. Fayazi, K. Lee, J. Caverlee, and A. Squicciarini, “Uncovering crowdsourced manipulation of online reviews,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’15, 2015, p. 233–242.
[13] Y. Wang, J. Zhang, S. Guo, H. Yin, C. Li, and H. Chen, “Decoupling representation learning and classification for gnn-based anomaly detection,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’21, 2021, p. 1239–1248.
[14] G. Wang, S. Xie, B. Liu, and S. Y. Philip, “Review graph based online store review spammer detection,” in 2011 IEEE 11th international conference on data mining, ser. ICDM ’11, 2011, pp. 1242–1247.
[15] H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao, “Spotting fake reviews via collective positive-unlabeled learning,” in 2014 IEEE international Conference on Data Mining, ser. ICDM ’14, 2014, pp. 899–904.
[16] A. Mukherjee, B. Liu, and N. Glance, “Spotting fake reviewer groups in consumer reviews,” in Proceedings of the 21st International Conference on World Wide Web, ser. WWW ’12, 2012, p. 191–200.
[17] S. Dhawan, S. C. R. Gangireddy, S. Kumar, and T. Chakraborty, “Spotting collective behaviour of online frauds in customer reviews,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ser. IJCAI ’19, 2019, pp. 245–251.
[18] S. Yu, F. Xia, Y. Sun, T. Tang, X. Yan, and I. Lee, “Detecting outlier patterns with query-based artificially generated searching conditions,” IEEE Transactions on Computational Social Systems, vol. 8, no. 1, pp. 134–147, 2021.
[19] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[20] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS ’17, 2017, pp. 1025–1035.
[21] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” stat, vol. 1050, p. 4, 2018.
[22] Y. Ren, B. Wang, J. Zhang, and Y. Chang, “Adversarial active learning based heterogeneous graph neural network for fake news detection,” in 2020 IEEE International Conference on Data Mining (ICDM), ser. ICDM ’20, 2020, pp. 452–461.
[23] Y. Liu, X. Ao, Q. Zhong, J. Feng, J. Tang, and Q. He, “Alike and unlike: Resolving class imbalance problem in financial credit risk assessment,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, ser. CIKM ’20, 2020, p. 2125–2128.
[24] J. Wang, R. Wen, C. Wu, Y. Huang, and J. Xion, “Fdgars: Fraudster detection via graph convolutional networks in online app review system,” in Companion Proceedings of The 2019 World Wide Web Conference, ser. WWW ’19, 2019, p. 310–316.
[25] A. Li, Z. Qin, R. Liu, Y. Yang, and D. Li, “Spam review detection with graph convolutional networks,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, ser. CIKM ’19, 2019, pp. 2703–2711.
[26] Y. Dou, Z. Liu, L. Sun, Y. Deng, H. Peng, and P. S. Yu, “Enhancing graph neural network-based fraud detectors against camouflaged fraudsters,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, ser. CIKM ’20, 2020, pp. 315–324.
[27] S. Yu, F. Xia, J. Xu, Z. Chen, and I. Lee, “Offer: A motif dimensional framework for network representation learning,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, ser. CIKM ’20, 2020, p. 3349–3352.
[28] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, pp. 888–905, 2000.
[29] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Statistical properties of community structure in large social and information networks,” in Proceedings of the 17th international conference on World Wide Web, ser. WWW ’08, 2008, pp. 695–704.
[30] M. Newman, “Modularity and community structure in networks,” networks, vol. 7, no. 15, p. 16, 2006.
[31] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS ’17, 2017, pp. 972–981.
[32] F. M. Bianchi, D. Grattarola, and C. Alippi, “Spectral clustering with graph neural networks for graph pooling,” in International Conference on Machine Learning, ser. ICML ’20, 2020, pp. 874–883.
[33] C. Xu, J. Zhang, K. Chang, and C. Long, “Uncovering collusive spammers in chinese review websites,” in Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, ser. CIKM ’13, 2013, p. 979–988.
[34] S. Rayana and L. Akoglu, “Collective opinion spam detection: Bridging review networks and metadata,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 985–994.
[35] A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, “What yelp fake review filter might be doing?” in Seventh international AAAI conference on weblogs and social media, ser. AAAI ’13, 2013.
[36] J. Ye and L. Akoglu, “Discovering opinion spammer groups by network footprints,” in Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ser. ECMLPKDD’15. Springer, 2015, p. 267–282.
[37] Z. Wang, R. Hu, Q. Chen, P. Gao, and X. Xu, “Collueagle: Collusive review spammer detection using markov random fields,” Data Mining and Knowledge Discovery, vol. 34, no. 6, 2020.
[38] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “Scan: A structural clustering algorithm for networks,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’07, 2007, p. 824–833.
[39] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16, 2016, p. 855–864.