This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Name of the Title is Hope

(2018)
Abstract.

Graphs are ubiquitous in many applications and graph neural networks (GNN) are the state-of-the-art method for predictive tasks on graphs. However, prediction accuracy is not the only objective and fairness needs to be traded off for accuracy. Although there are extensive studies on fair GNN, the generalizability of the accuracy-fairness trade-off from training to test graphs is less studied. In this paper, we focus on node classification on bipartite graphs, which have wide applications in spam detection and recommendation systems. We propose to balance the two objectives (accuracy and fairness) on the unseen test graph, using a novel focal fairness loss and data augmentation. We show that although each of the techniques can improve one objective, they can either pay the price of sacrificing one of the multiple objectives, or do not generalize the fairness-accuracy trade-off to the population distribution. The proposed integrated focal loss and data augmentation lead to generalizable well-balanced trade-offs. We verify that the method is agnostic about the underlying GNN architectures and can work both on vanilla GNN and edge-weighted GNNGuard.

datasets, neural networks, gaze detection, text tagging
copyright: acmcopyrightjournalyear: 2018doi: 10.1145/1122445.1122456conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NYbooktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Computer systems organization Embedded systemsccs: Computer systems organization Redundancyccs: Computer systems organization Roboticsccs: Networks Network reliability

1. Introduction

Existing fake reviews severely undermine the trustworthiness of online commerce websites that host the reviews (luca2016reviews). People have studied many detection methods to identify those reviews accurately. Among all methods, graph-based approaches (rayana2015collective; li2019spam) have shown great promise. However, most of the prior methods (dou2020enhancing; wu2020graph; wang2019semi; dou2020robust) focused exclusively on either accuracy or robustness of the fraud detectors, ignoring the detection fairness. Although several existing works studied fairness issues on graph-based classifiers against sensitive attributes, such as gender, age, and race (kang2020inform; ijcai2019p456; li2021dyadic; ma2021subgroup; dai2021say; agarwal2021towards). Restricted by the anonymity of the spammers, we have no access to the profile of users’ attributes to study any fairness problems towards traditional sensitive attributes. In fact, graph-based spam detectors suffer from another fairness problem: reviewers will receive unfavorable detection based on the number of historical posts (equal to the degree of the reviewer nodes). Existing reviewers and spammers receive slack regulations since their few spams are discreetly hidden behind loads of normal content. In contrast to that, new reviewers who have very few posts face a higher risk of false detection and strict regulation. Such discrimination harms user trust and leads to less engagement in online commerce activities. Formally, this type of fairness issue is caused by the variant graph topologies (burkholder2021certification). Growing efforts have been dedicated to enhancing fairness towards topological bias (burkholder2021certification; li2021dyadic; spinelli2021fairdrop; dai2021say; agarwal2021towards), yet few works of graph-based spam detection task prompt fairness against topological property.

Fig. 1 gives an example of our review graph (luca2016reviews; rayana2015collective) which contains user, review, and product nodes. Edges represent the events that a user post reviews on some products. We define users who post fewer reviews than a certain threshold as the “protected” users, denoted by the sensitive attribute A=1A=1. Other users are the ”favored” ones, denoted by A=0A=0. Comparing the computational graphs of spams R1R_{1} and R2R_{2}, many non-spams help dilute the information about the suspiciousness of R2R_{2} after messages passing from the bottom up by GNN’s aggregation operation in Eq. (1). In this case, existing users with many reviews can reduce suspiciousness and evade detection, which is unfair to new users.

Refer to caption
Figure 1. Problem setting and challenges. Left: a toy example of the review graph 𝒢\mathcal{G} and group membership. GNN infers the suspiciousness of reviews (e.g., R2R_{2}) by passing and aggregating the messages from its neighbors. Right: computational graphs of GNN on spam reviews from different groups and subgroups. GNN unfairly assigns a low suspiciousness to spam R2R_{2} posted by mixed user U1U_{1}, due to the aggregation of messages coming from other non-spams posted by U1U_{1}.

In fact, maintaining fairness between groups of users categorized only by node degree is imprecise. The detection fairness essentially depends on if two users have the same capability to hide their spamming reviews from their normal reviews. Two “favored” users can receive different treatment from the detector due to their heterogeneous behaviors, i.e., the various proportions of spams to non-spams across users. As for “protected” users, they post only a few reviews but belong to the same class, reflecting homogeneous user behaviors, whose reviews are treated equally by the detector. In order to ensure fairness in spam detection, the heterogeneous behavior of “favored” users must be accurately described by an additional sensitive attribute AA^{\prime} to indicate if their reviews are from the same class. In Fig. 1, user U1U_{1}, named as the “mixed” user, denoted by A=1A^{\prime}=1, posts both spams and non-spams where spams deceive the detector by aggregating messages from non-spams. U2U_{2}, named as the “pure” user, denoted by A=0A^{\prime}=0, posts only spams and thus receives no messages from non-spams to lower the suspiciousness calculated by the GNN model.. In order to get higher accuracy on the favored group, GNN will unfairly target spam reviews posted by pure users like U2U_{2} since they are easier to detect. Once we let the spams from “mixed” users are easy to be caught as well, GNN will have better performance on detecting spams from “mixed” users without harming the detection accuracy of spams from “pure” user. As a result, distinguishing mixed users from pure users is crucial. Solving this problem involves three main challenges:

Define the subgroup structure. Most of the previous work studied the fairness problem with well-defined sensitive attributes. Studies including (ustun2019fairness; kearns2018preventing) utilize observable sensitive attributes and their combinations to divide the dataset into various groups. Others (celis2021fair; awasthi2020equalized; mehrotra2021mitigating) consider the fairness problem with unobserved or noisy sensitive attributes. In (chen2019fairness), researchers built probabilistic models to approximate the value of any well-defined sensitive attribute by using proxy information. For example, it infers race from the last name and geolocation. All of these methods are implemented on the I.I.D vector data with well-defined sensitive attributes. Unlike the above works, we enhance fairness by discovering a novel structural-based sensitive attribute AA^{\prime} and its approximation. Additionally, sensitive attributes in previous works (dwork2012fairness; hardt2016equality; zemel2013learning) have been treated as data characteristics, which are not determined by ground truth labels. We are looking for an undefined sensitive attribute AA^{\prime} that is both specific to graph-based spam detection and related to the ground truth of reviews.

Refer to caption
Figure 2. Adding AA^{\prime} improves the accuracy of the detector on spams from the “favored” group (A=0A=0).

Infer unknown subgroup membership. We hypothesize that inferring AA^{\prime} helps resolve the unfairness among groups. Indicating a user’s heterogeneous behavior by the subgroup indicator AA^{\prime}, either mixed or pure, helps clarify what users from the “favored” group genuinely benefit from their non-spams. In this case, GNN can use the inferred indicator AA^{\prime} to strike a more equitable detection on spams posted by users from the “favored” group and then promote group fairness. (See Fig. 2). Nevertheless, the indicator AA^{\prime} relates to the ground truth labels, which are unobservable for the test data and need to be inferred. Even for the training data, labeling sufficient data and precisely identifying spams are great challenges that are expensive and time-consuming (qiu2020adaptive; tan2013unik; tian2020non).
Promote group fairness through subgroup information. Considering fairness towards multiple sensitive attributes, the related works (kearns2018preventing; ustun2019fairness) formulated the optimization problem with several fairness constraints for each group of combinations. These sensitive attributes must be precise and discrete in order to categorize data and ensure that those being discriminated against are fairly treated. However, these premises are inapplicable to our inferred subgroup membership indicator AA^{\prime}, which is probabilistic rather than deterministic. Meanwhile, constraints derived from thresholding noisy sensitive attributes can deteriorate optimization algorithms. Fairness optimization is highly sensitive to group separation, so we avoid setting an uncertain threshold for converting membership from a probability to a concrete group.

The main contributions of this work are as follows:

  • [leftmargin=*]

  • In addition to well-known sensitive attributes, i.e., user node degree (burkholder2021certification) or node attributes (bose2019compositional; dai2021say; wu2021learning), we first define a new structural-based and label-related sensitive attribute AA^{\prime} in the fair spam detection task on graph.

  • We propose another GNN g\btg_{\bt} to infer the probability distribution of AA^{\prime} for the test users who have unlabelled reviews (i.e., unobserved user behaviors). Due to the insufficient training examples for the “favored” group and the “mixed” subgroup, we propose two fairness-aware data augmentation methods to synthesize nodes and edges. GNN will generate less biased representations of the minority groups and subgroups by using the re-balanced training datasets.

  • Rather than converting the estimated probability distribution of AA^{\prime} into concrete groups by any uncertain threshold, a joint training method is designed to let the detector f\bwf_{\bw} directly use the inferred AA^{\prime}. Our proposed method is evaluated on three real-world Yelp datasets. The results demonstrate the unfairness inside the “favored” group and show group fairness can be enhanced by introducing the subgroup membership indicator AA^{\prime}. We show that regardless of what thresholds are used to split the group regarding AA, our joint method promotes group fairness by accurately inferring the distribution of AA^{\prime} during model training.

2. Preliminaries

2.1. Spam detection based on GNN

Our spam detection is on the review-graph defined as 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where 𝒱={v1,,vN}\mathcal{V}=\{v_{1},\dots,v_{N}\} denotes the set of nodes and 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} denotes the set of undirected edges. Each node vi𝒱v_{i}\in\mathcal{V} has a feature vector 𝐱i\mathbf{x}_{i} with node index ii. 𝒢\mathcal{G} contains three types of nodes: user, review, and product, where each node is from only one of the three types (see the example in Fig. 1 left). Let 𝒱U,𝒱R\mathcal{V}^{U},\mathcal{V}^{R} and 𝒱P\mathcal{V}^{P} denote the sets of user, review, and product nodes, respectively. viv_{i} has a set of neighboring nodes denoted by 𝒩(i)={vj𝒱|ei,j}\mathcal{N}(i)=\{v_{j}\in\mathcal{V}|e_{i,j}\in\mathcal{E}\}. The work focuses on detecting spammy users and reviews, and the task is essentially a node classification problem. GNN (kipf2016semi) is the state-of-the-art method for node classification (wu2020comprehensive). For our GNN detector f\bw()f_{\bw}(\cdot), let 𝐡i(l)\mathbf{h}_{i}^{(l)} be the learned representation of node viv_{i} at layer ll, where l=1,Ll=1,\dots L:

(1) 𝐡i(l)=UPDATE(AGGREGATE({𝐡i(l1)}{𝐡j(l1)j𝒩(i)}),\bw(l)),\mathbf{h}_{i}^{(l)}=\text{UPDATE}\left(\text{AGGREGATE}\left(\left\{\mathbf{h}_{i}^{(l-1)}\right\}\bigcup\left\{{\mathbf{h}_{j}^{(l-1)}\mid j\in\mathcal{N}(i)}\right\}\right),\bw^{(l)}\right),

where AGGREGATE takes the mean over 𝐡i(l)\mathbf{h}_{i}^{(l)} and messages passed from its neighboring nodes. UPDATE contains an affine mapping with parameters 𝑾(l)\boldsymbol{W}^{(l)} followed by a nonlinear mapping (ReLU in l=1,,L1l=1,\dots,L-1 and Sigmoid in l=Ll=L). The input vector 𝐱i=𝐡i(0)\mathbf{x}_{i}=\mathbf{h}_{i}^{(0)} is the representation at layer 0. y^i=𝐡i(L)\hat{y}_{i}=\mathbf{h}_{i}^{(L)}\in\mathbb{R} denotes the prediction probability of viv_{i} being spam. We minimize the cross-entropy loss for the training node set 𝒱Tr\mathcal{V}^{\text{Tr}}:

(2) (𝑾;𝒢)=1|𝒱Tr|vi𝒱Tr(yilogy^i(1yi)log(1y^i)),\mathcal{L}(\boldsymbol{W};\mathcal{G})=\frac{1}{|\mathcal{V}^{\text{Tr}}|}\sum_{v_{i}\in\mathcal{V}^{\text{Tr}}}\left(-y_{i}\cdot\log\hat{y}_{i}-(1-y_{i})\cdot\log(1-\hat{y}_{i})\right),

where yi{0,1}y_{i}\in\{0,1\} is the class for viv_{i}. \bw\bw represents a collection of parameters in all the layers LL. The main notations are in Table 1.

2.2. Fairness regularizer

Group. We split users 𝒱U\mathcal{V}^{U} into the protected group 𝒱1U\mathcal{V}^{U}_{1} whose degree is smaller than the pp-th percentile of all the users’ degrees, and the favored group 𝒱0U\mathcal{V}^{U}_{0} for the remaining users (see Fig. 1 left). The subscript denotes the value of AA. The review nodes 𝒱R\mathcal{V}^{R} are divided into 𝒱1R\mathcal{V}^{R}_{1} and 𝒱0R\mathcal{V}^{R}_{0} following the group of their associated users. The fairness of the GNN detector is evaluated by the detection accuracies between two groups of spams 𝒱0R\mathcal{V}^{R}_{0} and 𝒱1R\mathcal{V}_{1}^{R}.

Fairness regularizer. Ranking-based metrics like NDCG are appropriate to evaluate the detector’s accuracy, as the highly skewed class distribution: most reviews are genuine. A larger NDCG score indicates that spams are ranked higher than non-spams, and a detector is more precise. NDCG can also be evaluated on the two groups 𝒱0R\mathcal{V}^{R}_{0} and 𝒱1R\mathcal{V}^{R}_{1}, separately. Hence, the group fairness can be evaluated by the NDCG gap between 𝒱0R\mathcal{V}^{R}_{0} and 𝒱1R\mathcal{V}^{R}_{1}, denoted as ΔNDCG\Delta_{\text{NDCG}}. In fact, NDCG on the “favored” group 𝒱0R\mathcal{V}^{R}_{0} is always lower than that on the 𝒱1R\mathcal{V}^{R}_{1}, since the detector performs preciser on group 𝒱1R\mathcal{V}^{R}_{1}. For reducing ΔNDCG\Delta_{\text{NDCG}} by promoting the NDCG of 𝒱0R\mathcal{V}^{R}_{0} without hurting that of 𝒱1R\mathcal{V}^{R}_{1}, our detector f𝑾()f_{\boldsymbol{W}}(\cdot) starts with a fairness regularizer fair\mathcal{R}_{\text{fair}} which takes negative NDCG of 𝒱0R\mathcal{V}^{R}_{0}. We adopt a differentiable surrogate NDCG (burkholder2021certification) for fair\mathcal{R}_{\text{fair}}

(3) fair(𝑾;𝒢)\displaystyle\mathcal{R}_{\text{fair}}(\boldsymbol{W};\mathcal{G}) =1Z0i,j:yi<yjvi,vj𝒱0R𝒱Trlog(1+exp(𝐡j(L)𝐡i(L))),\displaystyle=-\frac{1}{Z_{0}}\sum_{i,j:y_{i}<y_{j}\atop v_{i},v_{j}\in\mathcal{V}^{R}_{0}\cap\mathcal{V}^{\text{Tr}}}\log\left(1+\exp\left(\mathbf{h}_{j}^{(L)}-\mathbf{h}_{i}^{(L)}\right)\right),
(4) GNN(𝑾;𝒢)\displaystyle\mathcal{L}_{\text{GNN}}(\boldsymbol{W};\mathcal{G}) =(𝑾;𝒢)+λfair(𝑾;𝒢),\displaystyle=\mathcal{L}(\boldsymbol{W};\mathcal{G})+\lambda\cdot\mathcal{R}_{\text{fair}}(\boldsymbol{W};\mathcal{G}),

where Z0Z_{0} is the total number of pairs of spams and non-spams in the training 𝒱0R\mathcal{V}^{R}_{0}. GNN\mathcal{L}_{\text{GNN}} is the objective function for training f𝑾()f_{\boldsymbol{W}}(\cdot) by adding fair\mathcal{R}_{\text{fair}} to (𝑾;𝒢)\mathcal{L}(\boldsymbol{W};\mathcal{G}) in Eq. (2). λ>0\lambda>0 is the importance of the fairness regularizer. Note that the fairness regularizer fair\mathcal{R}_{\text{fair}} regularizes all the models referred to in this paper below.

Table 1. Notations and definitions.
Notations Definitions
Graph notations
𝒢\mathcal{G} Review graph
𝒱,\mathcal{V},\mathcal{E} Nodes and Edges of graph 𝒢\mathcal{G}
𝐱i,yi\mathbf{x}_{i},y_{i} Feature and label of node viv_{i}
𝒩(i)\mathcal{N}(i) Set of direct neighbors of viv_{i}
|𝒱||\mathcal{V}| Cardinality of a set 𝒱\mathcal{V}
𝒱Tr,𝒱Te\mathcal{V}^{\text{Tr}},\mathcal{V}^{\text{Te}} Training nodes and test nodes
𝒱U,𝒱R,𝒱P\mathcal{V}^{U},\mathcal{V}^{R},\mathcal{V}^{P} User, review, and product nodes
Group notations
A,AA,A^{\prime} Binary sensitive attributes (0/1)
𝒱aR,𝒱aU\mathcal{V}^{R}_{a},\mathcal{V}^{U}_{a} Review and user nodes from group of A=aA=a
𝒱a,aR,𝒱a,aU\mathcal{V}^{R}_{a,a^{\prime}},\mathcal{V}^{U}_{a,a^{\prime}} Review and user nodes from group of A=aA=a and A=aA^{\prime}=a^{\prime}
Model notations
f𝑾(),g𝜽()f_{\boldsymbol{W}}(\cdot),g_{\boldsymbol{\theta}}(\cdot) GNNs with parameters 𝑾\boldsymbol{W} and 𝜽\boldsymbol{\theta}
y^i,A^i\hat{y}_{i},\hat{A}^{\prime}_{i} Output of f𝑾(),g𝜽()f_{\boldsymbol{W}}(\cdot),g_{\boldsymbol{\theta}}(\cdot) for viv_{i}
hi(l)\textbf{h}_{i}^{(l)} Representation of viv_{i} on layer ll
𝐱~ij\tilde{\mathbf{x}}_{ij} Synthetic node by mixing-up viv_{i} and vjv_{j}
y~ij\tilde{y}_{ij} Label for the synthetic node 𝐱~ij\tilde{\mathbf{x}}_{ij}

3. EXPERIMENTS

We seek to answer the following research questions:
RQ1: Do fairness issues exist between the “favored” and the “protected” groups, and between the “mixed” and the “pure” groups when using a GNN spam detector? Will the inferred AA^{\prime} improve the group fairness?
RQ2: How to infer the subgroup memberships AA^{\prime} defined in Eq. (LABEL:eq._definition_for_A') with a shortage of training examples?
RQ3: Can the joint training method simultaneously promote the AUC of predicted AA^{\prime} and group fairness?
RQ4: Does the accuracy of predicting AA^{\prime} contribute to the improvement in group fairness?

3.1. Experimental Settings

Datasets. We used three Yelp review datasets (“Chi”, “NYC”, and “Zip” for short, see Table 2) which are commonly used in previous spam detection tasks (burkholder2021certification; dou2020enhancing). For the cutoff degree of user nodes (pp-th percentile in Section 2.2), we treat p{20,15,10}p\in\{20,15,10\} as a hyper-parameter distinguishing favored (top p%p\% high-degree users, A=0A=0) from protected groups (the remaining users, A=1A=1). Reviews have the same value of AA as their associated users. The favored users are further split into pure (A=0A^{\prime}=0) and mixed (A=1A^{\prime}=1) users following Eq. (LABEL:eq._definition_for_A'). Users are divided into training (30%30\%), validation (10%10\%), and test (60%60\%) sets with their associated reviews.

Table 2. Statistics of the datasets. We list the numbers of products, reviews, and users with the proportion of favored users/reviews (%𝒱0U\%\mathcal{V}_{0}^{U}/%𝒱0R\%\mathcal{V}_{0}^{R}) and mixed users/reviews (%𝒱0,1U\%\mathcal{V}_{0,1}^{U}/%𝒱0,1R\%\mathcal{V}_{0,1}^{R}) under 20-th, 15-th, and 10-th percentile (PC) of cutoff degree of the user groups. The last column gives the ratio of spam in the group of A=0A=0 to A=1A=1.
\multirow2*Name Data Statistics \multirow2* P(Y=1|A=0)P(Y=1|A=1)\frac{P(Y=1|A=0)}{P(Y=1|A=1)}
|𝒱P||\mathcal{V}^{P}| |𝒱R||\mathcal{V}^{R}| |𝒱U||\mathcal{V}^{U}|
\multirow5*Chi \multirow5*201201 67,39567,395 38,06338,063
PC %𝒱0R\%\mathcal{V}^{R}_{0} %𝒱0,1R\%\mathcal{V}^{R}_{0,1} %𝒱0U\%\mathcal{V}^{U}_{0} %𝒱0,1U\%\mathcal{V}^{U}_{0,1}
20th 13.581%13.581\% 0.116%0.116\% 1.781%1.781\% 0.011%0.011\% 0.04380.0438
15th 18.687%18.687\% 0.129%0.129\% 3.003%3.003\% 0.026%0.026\% 0.01250.0125
10th 27.003%27.003\% 0.224%0.224\% 5.735%5.735\% 0.058%0.058\% 0.02820.0282
\multirow5*NYC \multirow5*923923 358,911358,911 160,220160,220
PC %𝒱0R\%\mathcal{V}^{R}_{0} %𝒱0,1R\%\mathcal{V}^{R}_{0,1} %𝒱0U\%\mathcal{V}^{U}_{0} %𝒱0,1U\%\mathcal{V}^{U}_{0,1}
20th 12.665%12.665\% 0.193%0.193\% 0.760%0.760\% 0.009%0.009\% 0.04790.0479
15th 16.264%16.264\% 0.258%0.258\% 1.171%1.171\% 0.018%0.018\% 0.02890.0289
10th 23.096%23.096\% 0.360%0.360\% 2.260%2.260\% 0.034%0.034\% 0.04200.0420
\multirow5*Zip \multirow5*5,0445,044 608,598608,598 260,277260,277
PC %𝒱0R\%\mathcal{V}^{R}_{0} %𝒱0,1R\%\mathcal{V}^{R}_{0,1} %𝒱0U\%\mathcal{V}^{U}_{0} %𝒱0,1U\%\mathcal{V}^{U}_{0,1}
20th 6.859%6.859\% 0.050%0.050\% 0.272%0.272\% 0.002%0.002\% 0.04260.0426
15th 15.658%15.658\% 0.145%0.145\% 1.020%1.020\% 0.009%0.009\% 0.01390.0139
10th 22.342%22.342\% 0.278%0.278\% 1.968%1.968\% 0.018%0.018\% 0.02410.0241

Evaluation Metrics. (1) NDCG is to evaluate the detector’s accuracy, where a higher NDCG means that the detector tends to assign a higher suspiciousness to spams than non-spams. (2) To evaluate the ranking of spams within the group of A=0A=0, a metric called “Average False Ranking Ratio” is proposed to measure the average of relative ranking between spams from mixed and pure users

(5) AFRRA=1Zyj=1vj𝒱0,ARi=1|𝒱0R|\mathbbm1[y^i>y^j,yi=0]i=1|𝒱0R|\mathbbm1[yi=0],A{0,1}\text{AFRR}_{A^{\prime}}=\frac{1}{Z}\sum_{y_{j}=1\atop v_{j}\in\mathcal{V}^{R}_{0,A^{\prime}}}\frac{\sum_{i=1}^{|\mathcal{V}^{R}_{0}|}\mathbbm{1}\left[\hat{y}_{i}>\hat{y}_{j},y_{i}=0\right]}{\sum_{i=1}^{|\mathcal{V}^{R}_{0}|}\mathbbm{1}[y_{i}=0]},\quad A^{\prime}\in\{0,1\}

where A{0,1}A^{\prime}\in\{0,1\} denotes the subgroup membership. ZZ is the number of spams from a subgroup. 𝒱0,AR\mathcal{V}^{R}_{0,A^{\prime}} denotes the reviews from a subgroup users. The ratio in the above equation calculates the proportion of non-spams ranked higher than spams over all the non-spams from the group of A=0A=0. The lower the AFRR, the fewer non-spams ranked higher than spams. Compared to NDCG, AFRR considers the non-spams across different subgroups and ignores the relative ranking of spams from the other subgroup. (3) AUC is to evaluate the performance of g\btg_{\bt} in predicting AA^{\prime}. The larger the AUC value, the more accurate AA^{\prime} given by g\btg_{\bt}.

Methods compared. Joint+GNN-𝐒𝟏Tr\mathbf{S_{1}^{\textbf{Tr}}} denotes our method: Joint is the joint training for two GNNs (f\bwf_{\bw} and g\btg_{\bt} in Section LABEL:sec_3.3:_joint_training), GNN-𝐒𝟏Tr\mathbf{S_{1}^{\textbf{Tr}}} is a GNN with our mixup method in Eq. (LABEL:eq:_mixup_second_node). “a+b” denotes a setting: “a” is one of a method for obtaining the value of AA^{\prime}; “b” is a spam detector.

Baselines for obtaining the value of AA^{\prime} (selections for “a”):

  • [leftmargin=*]

  • w/o: has no definition of AA^{\prime}.

  • Random: randomly assigns 1/01/0 to AA^{\prime}.

  • GT: assigns the ground truth of AA^{\prime} for users, which is the ideal case, as AA^{\prime} is unknown.

  • Pre-trained: is a variant of Joint that pre-training g𝜽g_{\boldsymbol{\theta}} to infer AA^{\prime} and fixes the inferred AA^{\prime} when training f𝑾f_{\boldsymbol{W}}.

Baselines for the spam detectors (selections for “b”):

  • [leftmargin=*]

  • FairGNN (dai2021say): is an adversarial method to get fair predictions for all groups defined by the known AA.

  • EDITS (dong2022edits): modifies the node attribute and the graph structure to debias the graph. FairGNN and EDITS consider AA as a known sensitive attribute and exclude any information about AA^{\prime} defined by our work.

  • GNN: is the vanilla GNN.

  • GNN-𝐒𝟎Te\mathbf{S_{0}^{\textbf{Te}}} is a GNN with the mixup Case 2 in Eq. (LABEL:eq:_mixup_second_node).

  • GNN-𝐒𝟏Te\mathbf{S_{1}^{\textbf{Te}}} is a GNN with the mixup Case 3 in Eq. (LABEL:eq:_mixup_second_node).

We set T=300T=300, λ=5\lambda=5, β1=β2=0.001\beta_{1}=\beta_{2}=0.001, weight decay =0.0001=0.0001 for both f𝑾f_{\boldsymbol{W}} and g𝜽g_{\boldsymbol{\theta}} in Algorithm LABEL:alg:_joint_train, mixup weight α=0.8\alpha=0.8. We have 10 training-validation-test splits of the three datasets. The following results are based on the aggregated performance on all splits.

Table 3. NDCG values for the outputs of spam detectors on three Yelp datasets with 2020-percentile of user degree as a cutoff for group of AA. Shown are the mean and standard deviation of ΔNDCG\Delta_{\text{NDCG}} over all 10 splits. “*” denotes our method. The lower the ΔNDCG\Delta_{\text{NDCG}}, the fairer the model.
Detector Metrics(%) Chi NYC Zip
\multirow3*FairGNN (dai2021say) NDCG(𝒱R\mathcal{V}^{R}) 86.2±\pm0.1 85.9±\pm0.0 89.6±\pm0.5
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 86.2±\pm0.1 86.0±\pm0.0 89.6±\pm0.4
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 53.7±\pm2.1 20.7±\pm1.1 36.4±\pm4.6
\multirow3*EDITS (dong2022edits) NDCG(𝒱R\mathcal{V}^{R}) 84.3±\pm0.3 84.9±\pm0.1 89.2±\pm0.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.3±\pm0.3 85.1±\pm0.1 89.3±\pm0.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 50.9±\pm2.3 32.7±\pm1.5 39.4±\pm10.6
\multirow2*Detector \multirow2*Metrics(%) GNN(gθ)(g_{\boldsymbol{\theta}}) GNN(gθ)(g_{\boldsymbol{\theta}}) GNN(gθ)(g_{\boldsymbol{\theta}})
w/o Pre-trained Joint* w/o Pre-trained Joint* w/o Pre-trained Joint*
\multirow3*GNN NDCG(𝒱R\mathcal{V}^{R}) 84.5±\pm0.9 83.2±\pm1.5 83.3±\pm2.2 85.2±\pm0.8 85.1±\pm0.8 85.2±\pm0.5 88.4±\pm1.4 87.6±\pm1.5 88.6±\pm1.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.7±\pm0.9 83.5±\pm1.3 83.6±\pm2.1 85.1±\pm 0.8 85.3±\pm0.8 85.2±\pm0.5 88.4±\pm1.4 87.6±\pm1.5 88.6±\pm1.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 51.2±\pm2.1 51.0±\pm3.0 50.7±\pm3.5 21.9±\pm7.0 21.8±\pm7.8 21.3±\pm8.9 36.3±\pm10.4 34.3±\pm10.8 34.8±\pm11.1
\multirow3*GNN-𝐒𝟏Tr\mathbf{S_{1}^{\textbf{Tr}}}* NDCG(𝒱R\mathcal{V}^{R}) 85.6±\pm0.7 85.8±\pm0.5 85.6±\pm0.8 85.8±\pm0.1 85.9±\pm0.0 85.9±\pm0.0 89.7±\pm0.2 89.7±\pm0.1 89.6±\pm0.1
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.6±\pm0.7 85.8±\pm0.5 85.7±\pm0.8 85.9±\pm0.0 86.0±\pm0.0 86.0±\pm0.0 89.7±\pm0.2 89.7±\pm0.1 89.6±\pm0.1
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 51.6±\pm0.9 50.3±\pm1.0 50.1±\pm1.0 19.1±\pm5.5 19.0±\pm5.2 17.9±\pm6.1 38.7±\pm7.2 36.0±\pm9.0 34.3±\pm11.5
\multirow3*GNN-𝐒𝟎Te\mathbf{S_{0}^{\textbf{Te}}} NDCG(𝒱R\mathcal{V}^{R}) 85.1±\pm0.7 85.3±\pm1.4 85.2±\pm1.4 85.3±\pm0.6 85.4±\pm0.5 85.4±\pm0.4 89.4±\pm0.6 89.6±\pm0.1 89.0±\pm0.06
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.2±\pm0.8 83.4±\pm1.3 83.5±\pm1.3 85.2±\pm0.6 85.5±\pm0.4 85.5±\pm0.4 89.4±\pm0.06 89.0±\pm0.01 89.1±\pm0.06
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 51.2±\pm1.5 50.9±\pm2.4 50.9±\pm2.3 21.9±\pm6.9 21.9±\pm6.3 20.9±\pm9.4 38.9±\pm7.6 36.9±\pm9.5 34.9±\pm11.0
\multirow3*GNN-𝐒𝟏Te\mathbf{S_{1}^{\textbf{Te}}} NDCG(𝒱R\mathcal{V}^{R}) 84.7±\pm1.3 83.7±\pm0.9 83.1±\pm0.9 85.7±\pm0.1 85.8±\pm0.1 85.8±\pm0.2 89.6±\pm0.3 89.6±\pm0.1 89.5±\pm0.3
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.8±\pm1.3 83.9±\pm0.9 83.4±\pm0.9 85.8±\pm0.1 85.8±\pm0.1 85.8±\pm0.1 89.6±\pm0.3 89.6±\pm0.1 89.5±\pm0.3
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 51.3±\pm0.6 50.8±\pm0.6 50.2±\pm1.0 21.0±\pm5.4 19.8±\pm5.4 19.3±\pm6.8 38.7±\pm7.2 36.2±\pm9.5 34.6±\pm11.4
Table 4. NDCG values for the outputs of detectors on Yelp datasets with 1515-percentile of user degree as a cutoff for group of AA.
Detector Metrics(%) Chi NYC Zip
\multirow3*FairGNN (dai2021say) NDCG(𝒱R\mathcal{V}^{R}) 86.2±\pm0.2 85.9±\pm0.1 89.9±\pm0.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 86.3±\pm0.2 86.0±\pm0.1 90.0±\pm0.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 45.8±\pm1.7 20.0±\pm2.4 31.0±\pm2.7
\multirow3*EDITS (dong2022edits) NDCG(𝒱R\mathcal{V}^{R}) 84.3±\pm0.2 85.0±\pm0.1 89.2±\pm0.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.4±\pm0.2 85.1±\pm0.1 89.3±\pm0.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 43.8±\pm2.6 32.4±\pm1.5 34.4±\pm3.5
\multirow2*Detector \multirow2*Metrics(%) GNN(gθ)(g_{\boldsymbol{\theta}}) GNN(gθ)(g_{\boldsymbol{\theta}}) GNN(gθ)(g_{\boldsymbol{\theta}})
w/o Pre-trained Joint* w/o Pre-trained Joint* w/o Pre-trained Joint*
\multirow3*GNN NDCG(𝒱R\mathcal{V}^{R}) 84.3±\pm1.3 84.3±\pm0.9 84.4±\pm0.9 85.7±\pm0.2 84.5±\pm0.4 84.6±\pm0.5 89.5±\pm0.2 88.9±\pm0.6 88.9±\pm0.6
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.6±\pm1.1 84.7±\pm0.7 84.8±\pm0.7 85.8±\pm 0.2 84.6±\pm0.4 84.6±\pm0.5 89.5±\pm0.2 88.9±\pm0.6 88.9±\pm0.7
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 41.8±\pm2.4 40.9±\pm3.7 40.9±\pm4.0 16.3±\pm3.9 15.8±\pm2.7 15.2±\pm6.1 26.1±\pm3.4 26.6±\pm2.5 25.9±\pm2.7
\multirow3*GNN-𝐒𝟏Tr\mathbf{S_{1}^{\textbf{Tr}}}* NDCG(𝒱R\mathcal{V}^{R}) 86.0±\pm0.4 85.9±\pm0.3 85.9±\pm0.2 85.9±\pm0.1 85.9±\pm0.1 85.9±\pm0.1 89.7±\pm0.1 89.9±\pm0.0 87.9±\pm2.1
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 86.0±\pm0.4 86.0±\pm0.3 86.0±\pm0.2 85.9±\pm0.1 85.9±\pm0.1 85.9±\pm0.1 89.8±\pm0.1 89.9±\pm0.0 87.9±\pm2.1
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 43.2±\pm2.0 41.9±\pm5.8 40.5±\pm5.1 16.5±\pm3.8 15.6±\pm2.9 15.6±\pm3.6 26.4±\pm3.3 27.0±\pm3.2 22.7±\pm4.2
\multirow3*GNN-𝐒𝟎Te\mathbf{S_{0}^{\textbf{Te}}} NDCG(𝒱R\mathcal{V}^{R}) 84.4±\pm1.5 84.3±\pm1.1 84.4±\pm0.9 85.8±\pm0.2 84.7±\pm0.4 84.7±\pm0.4 89.4±\pm0.2 89.1±\pm0.6 89.1±\pm0.6
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.7±\pm1.1 84.6±\pm0.9 84.8±\pm0. 85.9±\pm0.2 84.7±\pm0.5 84.7±\pm0.5 89.5±\pm0.2 89.1±\pm0.6 89.2±\pm0.6
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 43.4±\pm2.4 41.7±\pm4.6 41.1±\pm4.2 16.6±\pm4.1 15.7±\pm3.0 15.6±\pm3.6 26.2±\pm4.0 27.2±\pm3.2 25.3±\pm2.7
\multirow3*GNN-𝐒𝟏Te\mathbf{S_{1}^{\textbf{Te}}} NDCG(𝒱R\mathcal{V}^{R}) 85.9±\pm0.5 85.8±\pm0.3 85.9±\pm0.3 85.8±\pm0.1 85.4±\pm0.3 85.5±\pm0.3 89.8±\pm0.1 89.9±\pm0.1 89.9±\pm0.1
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.9±\pm0.5 85.9±\pm0.3 85.9±\pm0.3 85.9±\pm0.2 85.5±\pm0.3 85.5±\pm0.3 89.8±\pm0.1 89.9±\pm0.1 89.9±\pm0.1
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 42.6±\pm2.8 40.7±\pm3.7 40.7±\pm3.7 16.0±\pm3.9 15.5±\pm3.0 15.6±\pm3.3 26.1±\pm3.5 27.1±\pm2.4 24.9±\pm2.7
Table 5. NDCG values for the outputs of detectors on Yelp datasets with 1010-percentile of user degree as a cutoff for group of AA.
Detector Metrics(%) Chi NYC Zip
\multirow3*FairGNN (dai2021say) NDCG(𝒱R\mathcal{V}^{R}) 86.2±\pm0.2 85.9±\pm0.1 89.9±\pm0.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 86.4±\pm0.2 86.0±\pm0.1 90.0±\pm0.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 33.8±\pm5.4 22.4±\pm1.9 24.8±\pm1.6
\multirow3*EDITS (dong2022edits) NDCG(𝒱R\mathcal{V}^{R}) 84.3±\pm0.2 85.0±\pm0.1 89.2±\pm0.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 84.5±\pm0.2 85.2±\pm0.1 89.4±\pm0.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 37.1±\pm2.0 30.0±\pm1.2 28.9±\pm1.3
\multirow2*Detector \multirow2*Metrics(%) GNN(gθ)(g_{\boldsymbol{\theta}}) GNN(gθ)(g_{\boldsymbol{\theta}}) GNN(gθ)(g_{\boldsymbol{\theta}})
w/o Pre-trained Joint* w/o Pre-trained Joint* w/o Pre-trained Joint*
\multirow3*GNN NDCG(𝒱R\mathcal{V}^{R}) 85.4±\pm0.5 84.7±\pm1.3 84.9±\pm1.4 84.8±\pm0.4 84.5±\pm0.3 84.6±\pm0.4 89.7±\pm0.3 89.6±\pm0.6 88.9±\pm0.6
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.7±\pm0.4 85.3±\pm0.8 85.4±\pm0.9 84.8±\pm 0.4 84.6±\pm0.3 84.7±\pm0.4 89.8±\pm0.3 89.6±\pm0.6 89.8±\pm0.4
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 34.7±\pm1.5 33.9±\pm2.5 34.7±\pm1.5 19.7±\pm1.6 19.1±\pm1.6 18.7±\pm2.0 25.0±\pm1.8 23.7±\pm1.7 23.6±\pm1.7
\multirow3*GNN-𝐒𝟏Tr\mathbf{S_{1}^{\textbf{Tr}}}* NDCG(𝒱R\mathcal{V}^{R}) 85.9±\pm0.4 86.1±\pm0.2 86.1±\pm0.2 85.9±\pm0.1 85.9±\pm0.1 85.9±\pm0.1 89.7±\pm0.1 89.9±\pm0.0 89.9±\pm0.0
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.9±\pm0.4 86.2±\pm0.2 86.2±\pm0.2 86.0±\pm0.1 86.0±\pm0.1 86.0±\pm0.1 89.8±\pm0.1 90.0±\pm0.0 90.0±\pm0.0
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 34.1±\pm4.6 33.7±\pm3.0 33.2±\pm2.6 19.1±\pm1.9 16.8±\pm1.5 16.6±\pm1.9 25.1±\pm1.6 23.4±\pm1.6 23.3±\pm1.4
\multirow3*GNN-𝐒𝟎Te\mathbf{S_{0}^{\textbf{Te}}} NDCG(𝒱R\mathcal{V}^{R}) 85.8±\pm0.5 86.1±\pm0.3 86.1±\pm0.2 85.9±\pm0.1 85.4±\pm0.3 85.4±\pm0.3 89.8±\pm0.1 90.0±\pm0.1 89.9±\pm0.1
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.9±\pm0.5 86.2±\pm0.3 86.2±\pm0.2 86.0±\pm0.1 85.5±\pm0.3 85.5±\pm0.3 89.9±\pm0.1 90.0±\pm0.1 90.0±\pm0.1
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 33.6±\pm4.4 34.1±\pm3.3 33.8±\pm3.0 19.7±\pm2.0 18.6±\pm1.5 18.6±\pm1.9 25.1±\pm1.6 23.7±\pm1.6 23.6±\pm1.4
\multirow3*GNN-𝐒𝟏Te\mathbf{S_{1}^{\textbf{Te}}} NDCG(𝒱R\mathcal{V}^{R}) 85.6±\pm0.5 85.0±\pm1.2 85.0±\pm1.2 85.9±\pm0.1 84.7±\pm0.3 84.7±\pm0.3 89.6±\pm0.1 89.7±\pm0.5 89.5±\pm1.2
NDCG(𝒱1R\mathcal{V}^{R}_{1}) 85.7±\pm0.6 85.4±\pm0.8 85.5±\pm0.7 86.0±\pm0.1 84.8±\pm0.4 84.8±\pm0.4 89.7±\pm0.1 89.8±\pm0.5 89.2±\pm0.7
\rowcolorgray!20 \cellcolorwhite ΔNDCG\Delta_{\text{NDCG}} 34.0±\pm4.6 34.3±\pm3.0 34.2±\pm2.8 19.6±\pm1.9 18.7±\pm1.7 18.7±\pm2.1 25.1±\pm1.6 23.8±\pm1.7 24.2±\pm2.8

3.2. Results

3.2.1. Group Fairness.

To answer RQ1, we take the difference in the NDCGs of reviews from groups of A=0A=0 and A=1A=1 as the group fairness metric, denoted by ΔNDCG\Delta_{\text{NDCG}}. Table 3, 4, and 5 show the NDCG values for the outputs of spam detectors with 2020-th, 1515-th, and 1010-th percentile of user node degree as a cutoff for groups of AA, respectively. Each table has rows for six detectors in two sections. FairGNN and EDITS in the upper section ignore the attribute AA^{\prime} that we define. The remaining four detectors use AA^{\prime} where the columns under each dataset column are methods for obtaining the value of AA^{\prime}. These tree tables report the NDCG of all test reviews NDCG(𝒱R\mathcal{V}^{R}), the NDCG of test reviews from the protected group NDCG(𝒱1R\mathcal{V}^{R}_{1}), and ΔNDCG\Delta_{\text{NDCG}}.

Large values of ΔNDCG\Delta_{\text{NDCG}} for FairGNN, EDITS, and w/o+GNN show the pervasive unfairness existing in the GNN-based models. FairGNN and EDITS have larger ΔNDCG\Delta_{\text{NDCG}} when improving NDCGs for the favored groups, meaning they worsen the fairness: the NDCGs are increased more on the favored groups than the protected groups. For any spam detector in the lower section, the proposed Joint method has the smallest ΔNDCG\Delta_{\text{NDCG}} in most cases. Comparing to Pre-trained, g\btg_{\bt} in Joint receives the additional gradient of \bt\bt from f\bwf_{\bw} (see Eq. (LABEL:eq:_update_theta)). However, for detectors without the review and user augmentation (i.e., Joint+GNN), the additional gradient may lead g\btg_{\bt} to infer AA^{\prime} hurting the performance of f\bwf_{\bw}. For any method obtaining the value of AA^{\prime}, the detector with our augmentation GNN-S1Tr\text{S}_{1}^{\text{Tr}} has the smallest ΔNDCG\Delta_{\text{NDCG}} almost in all cases. Since Chi and Zip contain fewer mixed users, maintaining the original distribution while mixing up is more complicated.

Refer to caption
Figure 3. Relation between the accuracy of predicting AA^{\prime} and group fairness. xx-axis: the gap of g\btg_{\bt}’s AUC in predicting AA^{\prime} between Joint and Pre-trained; yy-axis: the gap of ΔNDCG\Delta_{\text{NDCG}} of f\bwf_{\bw} in spam detection between Pre-trained and Joint. Joint method simultaneously promotes the AUC of predicted AA^{\prime} and group fairness.
Refer to caption
Figure 4. Sensitivity analysis for the replication time kk and pruning non-spam edges. This figure shows the test AUCs of g𝜽g_{\boldsymbol{\theta}} on graphs with various kk and w/ or w/o pruning edges.

Explanation of improved group fairness. Fig. 5 shows the test AFRRs of pure and mixed subgroups across four methods {w/o+GNN, Joint+GNN-S1Tr\text{S}_{1}^{\text{Tr}}, Joint+GNN-S0Te\text{S}_{0}^{\text{Te}}, Joint+ GNN-S1Te\text{S}_{1}^{\text{Te}}}. It demonstrates the effect of adding AA^{\prime} on spams from subgroups and explains the improved NDCG on the protected group. w/o+GNN shows that spams from the mixed users have larger AFRRs than the pure users for all datasets. In other words, inside the favored group, the basic GNN tends to rank spams from pure users higher than those from mixed users. By introducing AA^{\prime} and applying the augmentation methods, AFRR is decreased for mixed users and sometimes for pure users. Our method (rightmost) improves the NDCG for the protected group by mainly raising the rank of spams from mixed users and sometimes the spams from pure users.

\subfigtopskip

=0.5cm \subfigbottomskip= -10pt \subfigure[20-th Percentile of User Node Degree] Refer to caption \subfigbottomskip= -5pt \subfigure[15-th Percentile of User Node Degree] Refer to caption \subfigbottomskip= -5pt \subfigure[10-th Percentile of User Node Degree] Refer to caption

Figure 5. Box plot of AFRR in Eq. (5) for test mixed and pure users of all splits of three datasets under four settings (the proposed method is bold). A box records a set of AFRRs for all the splits, and the solid lines connect the mean of each box. Introducing AA^{\prime} and the joint training of f\bwf_{\bw} and g\btg_{\bt} (theJoint method) decreases AFRR for mixed users by giving larger suspiciousness to spam than non-spams.

3.2.2. Evaluation of the Joint method on improving the quality of AA^{\prime}.

To answer RQ2 and RQ3, we study the relationship between the group fairness and the quality of inferred AA^{\prime}. Recall that in Table 3, 4, and 5, the Joint method has smaller ΔNDCG\Delta_{\text{NDCG}} than Pre-trained in most cases. To understand the advantage of Joint, Fig. 3 shows the AUC gap of AA^{\prime} given by g\btg_{\bt} (xx-axis) in these two settings with corresponding ΔNDCG\Delta_{\text{NDCG}} difference (yy-axis). Most models in the area I indicate that Joint simultaneously promotes the accuracy of g\btg_{\bt} and the fairness of f\bwf_{\bw}. Since Joint updates \bt\bt using the gradient from f𝑾f_{\boldsymbol{W}} (see Eq. (LABEL:eq:_update_theta)), our mixup strategies can mitigate the overfitting for g𝜽g_{\boldsymbol{\theta}} with more gradients from the synthetic data.

\subfigtopskip

=2pt \subfigbottomskip= 2pt \subfigcapskip= -5pt \subfigure[20-th Percentile of User Node Degree] Refer to caption \subfigure[15-th Percentile of User Node Degree] Refer to caption \subfigure[10-th Percentile of User Node Degree] Refer to caption

Figure 6. Test ΔNDCG\Delta_{\text{NDCG}} for four detectors: GNN, GNN-S1Tr\text{S}_{1}^{\text{Tr}} (in dashed line; our method), GNN-S0Te\text{S}_{0}^{\text{Te}}, and GNN-S1Te\text{S}_{1}^{\text{Te}} given AA^{\prime} obtaining from five methods. The amount of noise in AA^{\prime} reduces for methods from left to right: w/o: without AA^{\prime}; Random: randomly let A=1/0A^{\prime}=1/0; Pre-trained: output of pre-trained g𝜽g_{\boldsymbol{\theta}}; Joint: output of jointly trained g𝜽g_{\boldsymbol{\theta}}; GT: ground truth of AA^{\prime}).

3.2.3. Impact of the accurate AA^{\prime} on group fairness.

Since the quality of AA^{\prime} correlates with the group fairness, we further increase (i.e., Random method) or decrease (i.e., GT method) the noise in AA^{\prime} to see if this correlation holds to answer RQ4. Fig. 6 gives ΔNDCG\Delta_{\text{NDCG}} for detectors with different ways to assign value to AA^{\prime} which contain less and less noise moving from left to right in xx-axis. ΔNDCG\Delta_{\text{NDCG}} reduces as a detector receives a more accurate inference of the values of AA^{\prime}.

3.3. Sensitivity studies for the replication times kk and if pruning non-spam edges.

Fig. 4 shows the test AUCs of g𝜽g_{\boldsymbol{\theta}} with replications k={50,100}k=\{50,100\} and if pruning non-spams (see Section LABEL:sec:minor_user_augmentation). Pruning has better AUCs than No pruning except when k=100k=100 on Zip. Additionally, AUCs for Chi and Zip decrease as increasing the number of kk. Since Chi and Zip have few mixed users, forcing the synthesized data to mimic the original node distribution is complex and easy to cause g\btg_{\bt} overfitting. Based on the validation set, we get the optimal k=100k=100 for NYC and k=50k=50 for Chi and Zip.

Acknowledgements.
To Robert, for the bagels and explaining CMYK and color spaces.