This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-attributed Community Search in Road-social Networks

Fangda Guo, Ye Yuan§, Guoren Wang§, Xiangguo Zhao, Hao Sun School of Computer Science and Engineering, Northeastern University, Shenyang, China
§School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
{fangda@stu, §yuanye@, zhaoxiangguo@, andynosh@stu}mail.neu.edu.cn,§[email protected]
Abstract

Given a location-based social network, how to find the communities that are highly relevant to query users and have top overall scores in multiple attributes according to user preferences? Typically, in the face of such a problem setting, we can model the network as a multi-attributed road-social network, in which each user is linked with location information and dd (1\geq\!1) numerical attributes. In practice, user preferences (i.e., weights) are usually inherently uncertain and can only be estimated with bounded accuracy, because a human user is not able to designate exact values with absolute precision. Inspired by this, we introduce a normative community model suitable for multi-criteria decision making, called multi-attributed community (MAC), based on the concepts of kk-core and a novel dominance relationship specific to preferences. Given uncertain user preferences, namely, an approximate representation of weights, the MAC search reports the exact communities for each of the possible weight settings. We devise an elegant index structure to maintain the dominance relationships, based on which two algorithms are developed to efficiently compute the top-jj MACs. The efficiency and scalability of our algorithms and the effectiveness of MAC model are demonstrated by extensive experiments on both real-world and synthetic road-social networks.

I Introduction

Numerous real-world networks (e.g., social networks) are made up of community structures, where discovering them is an essential problem in network analysis. Recently, community search, which is a kind of query-dependent community discovery problem and is designed to find densely connected subgraphs containing query vertices, has drawn much attention among database professionals due to an ever-growing number of applications [1, 2, 3, 4, 5, 6, 7, 8].

At the same time, with the prevalence of GPS-enabled mobile devices, location-based social networks (LBSN) are becoming more diverse and complex in recent years (e.g., Facebook Places, Foursquare). Since not only users and friendships are included, but each user is often associated with various properties, such as location information and numerical attributes. The location information is able to bridge the gap between virtual and physical worlds, while the numerical attributes obtained from user profiles or statistical information derived by various network analytics (e.g., influence, similarity, etc.) can characterize the user. For instance, in a scientific collaboration network such as Aminer, every author may have own (spatial) position/address and several numerical attributes (e.g., h-index, #publications, activeness, diverseness, etc.). Typically, we can model such network data as a multi-attributed road-social network, in which each vertex is linked with location and dd (1\geq\!1) numerical attributes.

Given a multi-attributed road-social network, how to identify the query-user-involved communities that are not dominated by the other communities according to user’s preferences for dd numerical attributes? For instance, considering h-index and activeness in the Aminer network, how to find a group of collaborators who are related to and close to the query users and take the activeness in their research fields in recent years as the main criterion? Similarly, considering #publications and diverseness, how to find a community of query users in the Aminer network such that the members provide a good tradeoff in terms of the number of publications and diverse research interests? It is noting that both the social and spatial cohesiveness of groups are incorporated.

In this regard, we want to develop a normative community model suitable for multi-criteria decision making. Considering that the traditional top-jj query receives a dataset of records with dd-dimensional attributes and a weight vector ww of dd values assigning the relative importance of each dimension to the user as input, where the weighted sum of attribute values is used as the score of a record, we evolve such scoring method into a multi-attributed community model. Similarly, the weight vector ww denotes the user preferences and is therefore crucial in generating useful recommendations for the community. Generally, ww can be directly input by the user or mined from his/her past behavior or choices [9, 10].

Driven by the fact that weight accuracy plays a vital role in the applicability and practicality of top-jj queries, we argue that the assumption that exact values of a weight vector are known is inherently inaccurate and almost unrealistic. To illustrate this, consider the case of manually assigning weights to the above example. By taking activeness as the main criterion, a user may specify weights 0.2 and 0.8 for h-index and activeness, respectively. At this point, leaving aside the participation of query users and spatial cohesiveness, the weighted sum of these two attribute values can be regarded as the influence (or importance) of the vertex in [4]. On the other hand, based on pure intuition, she may also specify the same weight per dimension (e.g., 0.5 each). The results may be similar to the skyline community [8], since weights are not biased towards any dimension (as verified in our experiments, the skyline community [8] is usually contained in our results). However, it is impractical and unfair for the user to require weights with absolute precision even a slight variation in the weight (e.g., from 0.2 to 0.19; see Example 3) may remarkably change the results. On the contrary, it would be preferable to take user inputs as generic instructions and leave room for inaccuracies in weight setting. Similarly, for the weight vector computed by preference learning techniques, it should serve only as a rough guide instead of an accurate expression of user preferences. Naturally, this issue can be dealt with by expanding the weight vector to a region111There are already preference learning techniques (e.g., [11]) to generate such a region instead of a specific weight vector. and returning all promising results to the user, thus providing a practical and more user-centric design.

Prior work on community search problem has never incorporated the uncertainty of weight vector, thus all existing community search algorithms are unable to answer the above questions. To adequately characterize such interesting communities w.r.t. user preferences, we propose a novel community model called multi-attributed community (MAC) based on the concepts of kk-core [12] and r-dominance (variant of traditional sense) [13]. An MAC is a maximal connected kk-core with query vertices contained and spatial cohesiveness satisfied, that is not r-dominated by other connected kk-cores in terms of dd-dimensional attributes w.r.t a region of interest RR (see Section II for detailed definition). Importantly, since the scores of communities all depend on ww, they are necessarily correlated and vary together as ww freely lies inside region RR. As a result, a partitioning of RR forms the output, in which each partition is associated with the MACs when ww falls anywhere in that partition. The MAC model is also applicable to many interesting applications, some of which are introduced below.

Personalized optimum community search. In daily life, people always want to discover the optimum community based on different needs. For example, a coach hopes to reorganize the school basketball team around certain players (as query users) to improve offense. In this application, we can limit the query scope to the school and extract three numerical attributes for each player: points, rebounds and assists. By setting the region of user preferences, we can obtain corresponding communities that are not r-dominated by the others. Similar query may also help organizations to analyze customer orientation or perform marketing/promotion activities.

Cohesive groups discovery in LBSNs. In some cases, one may wish to circle the target range by finding cohesive groups to achieve identification, such as COVID-19 precaution and suspect investigation. Given several confirmed cases, possible cases are likely to be within a certain range of them (providing possibility of close contact), and the Jaccard similarity (e.g., hobbies, interests) and influence (e.g., #neighbors) for each user can be extracted as numerical attributes; for the preliminary investigation of a case, given the victim and escape range, motive and #criminal records of the same or different types can be used as numerical attributes. By mining the MACs, we can get the desired cohesive groups related to query users.

Contributions. Efficient solutions are formulated and provided to find multi-attributed communities in road-social networks. Below, we summarize the contributions of this paper.

Novel community model. The MAC model is proposed in road-social networks, which can be used for finding communities not r-dominated by others. To the best of our knowledge, our model is the first to incorporate uncertainty of weight vectors, and our work is also the first to introduce the dominance relationship specific to preferences for community modeling.

New algorithms. An efficient global search, DFS-based algorithm, is first developed to find the top-jj MACs for each of the possible weight settings in user preferences. The time and space complexity of DFS-based algorithm are bounded by O(n2d)O(n^{\prime 2d}) and O(m+n+n2d)O(m^{\prime}\!+\!n^{\prime}\!+\!n^{\prime 2}\!\cdot\!d) respectively, where nn^{\prime} and mm^{\prime} denote the number of vertices and edges in the maximal (k,t)(k,t)-core of a road-social network. To further accelerate the search, we propose a more efficient local search framework, of which two striking features are that (1) its time complexity is much lower than that of global search (at least one order of magnitude faster in practice), and it can find all non-contained MACs in most cases; and (2) it enables the MACs to be output progressively during execution, which is very beneficial to applications expecting only part of the MACs.

Extensive experiments. To demonstrate the high efficiency and scalability of our proposed algorithms and to evaluate the effectiveness of the MAC model, we conduct extensive experiments and a comprehensive case study on both real-world and synthetic road-social networks, in which many significant and interesting communities are able to be discovered.

II Problem Statement

In this section, we formally introduce the multi-attributed community in road-social networks and its search problems. Table I summarizes the notations used throughout this paper.

II-A Preliminaries

Road network. We model a road network as an undirected weighted graph Gr=(Vr,Er)G_{r}\!=\!(V_{r},E_{r}), where VrV_{r} (resp. ErE_{r}) is the sets of vertices (resp. edges). A vertex uVru\!\in\!V_{r} represents a road intersection/end. An edge (u,v)Er(u,v)\!\in\!E_{r} represents a road segment allowed to travel between vertices uu and vv, and is associated with a non-negative weight ω(u,v)\omega(u,v) that represents its cost (e.g., distance/travel time). Let pp be a spatial point lying on edge (u,v)(u,v), and ω(u,p)\omega(u,p) be proportional to the length from uu to pp. By dist(p,p)dist(p,p^{\prime}) we refer to the network distance between points pp and pp^{\prime} in GrG_{r}, which is the sum of edge weights along the least costly (i.e., shortest) path from pp to pp^{\prime}.

Refer to caption
(a) Social network (GsG_{s})
Refer to caption
(b) Road network (GrG_{r})
Figure 1: Example of road-social network

Social network. We model a social network with dd numerical attributes as an undirected graph Gs=(Vs,Es,L,X)G_{s}\!=\!(V_{s},E_{s},L,X), where Vs(|Vs|=n)V_{s}(|V_{s}|=n) and Es(|Es|=m)E_{s}(|E_{s}|=m) denote the sets of vertices (i.e., users) and edges (i.e., social relations) respectively, LL and X(|X|=n)X(|X|=n) are the sets of mappings and dd-dimensional vectors defined on VsV_{s} respectively. Specifically, for each vertex vVsv\in V_{s}, L(v)L(v) provides a mapping of each user’s location (i.e., spatial point) in the road network; and vv is also associated with a dd-dimensional vector with real values, denoted by X(v)=(x1v,,xdv)X(v)\!=\!(x_{1}^{v},\ldots,x_{d}^{v}), where X(v)XX(v)\!\in\!X and xivx_{i}^{v}\!\in\!\mathbb{R}.

Road-social network. A road-social network is a pair of graphs, denoted by (Gr,Gs)(G_{r},G_{s}), where GrG_{r} is a road network and GsG_{s} is a social network. Each user usGsu_{s}\!\in\!G_{s} is associated with a spatial point pp in GrG_{r}, i.e., L(us)=pL(u_{s})\!=\!p, indicating the current location of usu_{s}. We assume that pp can either be on a vertex or edge of GrG_{r}.

TABLE I: Summary of Notations
Notation Description
Q,jQ,j query vertices and number of MACs to select among
k,tk,t coreness threshold and query distance threshold
d,Rd,R dimensionality of attributes and preference domain
dgH(v),δ(H)dg_{H}(v),\delta(H) degree of vertex vv in HH and minimum degree of HH
N(v,H)N(v,H) neighbor set of vertex vv in subgraph HH
L(v)L(v) mapping of user vv’s location (i.e., spatial point) in GrG_{r}
X(v)X(v) dd-dimensional vector with real values of user vv
DQ(H),S(H)D_{Q}(H),S(H) query distance and score of subgraph HH resp.
Hkt,GdH_{k}^{t},G_{d} the maximal (k,t)(k,t)-core and r-dominance graph
Ge,GcG_{e},G_{c} subgraphs of GdG_{d} induced by VHV_{H} and VGd\VHV_{G_{d}}\!\backslash\!V_{H} resp.
lb(Ge),lt(Gc)l_{b}(G_{e}),l_{t}(G_{c}) vertices in the bottom layer of GeG_{e} and top layer of GcG_{c}

For example, Fig. 1 displays a road-social network, where vertices represent users and road junctions, and edges represent social relations and road segments, respectively. For simplicity, the location of user viv_{i} in Fig. 1(a) is on the vertex rir_{i} of the road network in Fig. 1(b). Fig. 2(a) shows the values of part of vertices in three different dimensions.

II-B (k, t)-Core

We introduce a novel densely connected substructure, called (k,t)(k,t)-core, by focusing on the following structural cohesiveness and communication cost.

Structural cohesiveness. In order to model the structural cohesiveness, the generally used kk-core model is adopted to indicate the communities [12, 14, 1, 2, 4]. In particular, by dgGs(v)dg_{G_{s}}(v) we refer to the degree of vertex vv in social network GsG_{s}. Let H=(VH,EH,LH,XH)H\!=\!(V_{H},E_{H},L_{H},X_{H}) be an induced subgraph of GsG_{s}, where VHVsV_{H}\!\subseteq\!V_{s}, EH={(u,v)|u,vVH,(u,v)Es}E_{H}\!=\!\{(u,v)|u,v\!\in\!V_{H},(u,v)\!\in\!E_{s}\}, LH={L(v)|vVH}L_{H}\!=\!\{L(v)|v\!\in\!V_{H}\} and XH={X(v)|vVH}X_{H}\!=\!\{X(v)|v\!\in\!V_{H}\}.

Definition 1

(kk-core.) Given a graph GsG_{s} and an integer kk, HH is a kk-core of GsG_{s} if each vertex vVHv\!\in\!V_{H} has a degree at least kk, i.e., dgH(v)kdg_{H}(v)\!\geq\!k.

The maximal kk-core is the one satisfying that no super kk-core containing it exists. We refer to the maximal kk in all kk-cores containing vertex vVsv\!\in\!V_{s} as the core number of vv. In order to avoid confusion, we denote a connected kk-core as a kk-core^\widehat{core}, since the maximal kk-core is not necessarily connected.

Remarks. Although we use kk-core as the structural cohesiveness metric, our techniques can also be applied to other criteria such as kk-clique [15] and kk-truss [3].

Communication cost. For two users u,vGsu,v\in G_{s}, the length of the shortest path between their locations in GrG_{r} is denoted by dist(L(u),L(v))dist(L(u),L(v)), which is equal to ++\infty if L(u)L(u) and L(v)L(v) are not connected. To model the communication cost in GrG_{r}, we utilize the notion of query distance below.

Definition 2

(Query Distance.) Given a graph HH and query vertices QVHQ\!\subseteq\!V_{H}, qQ\forall q\!\in\!Q, the query distance of vVHv\!\in\!V_{H} is the maximum length of the shortest path from LH(v)L_{H}(v) to LH(q)L_{H}(q) in GrG_{r}, denoted by DQ(v)=maxqQdist(L(v),L(q))D_{Q}(v)\!=\!\max_{q\in Q}dist(L(v),L(q)); the query distance of HH is defined as DQ(H)=maxuVHDQ(u)=maxuVH,qQdist(L(u),L(q))D_{Q}(H)\!=\!\max_{u\in V_{H}}D_{Q}(u)\!=\!\max_{u\in V_{H},q\in Q}dist(L(u),L(q)).

By query distance DQ(H)D_{Q}(H), the communication cost between query vertices QQ and the members in HH can be measured. In general, a good community is considered to own a low communication cost, i.e., small DQ(H)D_{Q}(H). Consider the query vertices Q={v2,v3,v6}Q\!=\!\{v_{2},v_{3},v_{6}\} in Fig. 1. The query distance of v7v_{7} is DQ(v7)=dist(r7,r6)=7D_{Q}(v_{7})\!=\!dist(r_{7},r_{6})\!=\!7. The query distance of the subgraph induced by {v2,v3,v6,v7}\{v_{2},v_{3},v_{6},v_{7}\} is equal to dist(r3,r6)=9dist(r_{3},r_{6})\!=\!9. In the following, we propose a new notion of (k,t)(k,t)-core, by adapting the concepts of kk-core and query distance, to capture dense structural cohesiveness and low communication cost.

Definition 3

((k,t)(k,t)-core.) Given graphs (Gr,Gs)(G_{r},G_{s}), query vertices QQ, and numbers kk and tt, HH is a (k,t)(k,t)-core iff HH is a kk-core^\widehat{core} of GsG_{s} containing QQ and DQ(H)tD_{Q}(H)\!\leq\!t in GrG_{r}.

For a (k,t)(k,t)-core, its structural cohesiveness increases with kk, while proximity to query vertices decreases with tt. For instance, in Fig. 1, the subgraph induced by {v2,v3,v6,v7}\{v_{2},v_{3},v_{6},v_{7}\} for Q={v2,v3,v6}Q\!=\!\{v_{2},v_{3},v_{6}\} is a (k,t)(k,t)-core with k=3k\!=\!3 and t=9t\!=\!9.

II-C Score of Multi-Attributes

Note that generalizing the existing community models, most of which are only for 11-dimensional attribute such as influence [4], Euclidean distance [6] or keyword similarity [5, 7, 16, 17], to a comparable one with multi-dimensional attributes is nontrivial. Unlike the above, comparing two communities becomes quite tough if either can have dd (d>1d\!>\!1) values due to dd different dimensions. On the other hand, existing multi-dimensional model [8] cannot compare the pros and cons of any skyline communities and make trade-offs based on user preferences. To overcome the above issues, we introduce the r-dominance relationship between two communities, that will be developed for defining our multi-attributed community model.

As each vertex vVsv\!\in\!V_{s} associated with a vector X(v)X(v), the attributes define a dd-dimensional data domain. We assume that for each attribute a higher value is preferable, and a spatial index (e.g., R-tree [18]) is used to organize the vector set XX.

Referring to the traditional top-jj queries, the score of a vertex vv w.r.t X(v)X(v) can also be derived by inputting a weight vector w=(w1,w2,,wd)w\!=\!(w_{1},w_{2},\ldots,w_{d}) as

S(v)=i=1dwixiv.S(v)=\sum\nolimits_{i=1}^{d}w_{i}\cdot x_{i}^{v}. (1)

Thus, the top-jj results consist of the jj vertices with the highest scores. We assume w.l.o.g. that wi(0,1)w_{i}\!\in\!(0,1) for i[1,d]\forall i\!\in\![1,d] and i=1dwi=1\sum_{i=1}^{d}w_{i}\!=\!1. Such conditions cannot restrain user preferences, because score ranking depends only on the direction of ww instead of its magnitude [19]; but they allow ww to drop one weight (i.e., wd=1i=1d1wiw_{d}\!=\!1-\sum_{i=1}^{d-1}w_{i}), thereby mapping the domain of ww to a (d1d\!-\!1)-dimensional space, named the preference domain. This dimensionality reduction is critical [20], since the dimensionality directly determines the processing time of the costliest operations in our techniques (see Section V-B). In the following, ww refers to the (d1d\!-\!1)-dimensional form of the weight vector. For example, given weights 0.2 and 0.3 for x1x_{1} and x2x_{2} respectively in Fig. 2(a), we have w3=0.5w_{3}\!=\!0.5, and S(v7)=4.47S(v_{7})\!=\!4.47.

Refer to caption
(a) 33-dimensional vectors
Refer to caption
(b) Pref. domain and MACs
Figure 2: Numerical attributes and MACs in RR

Community score. Let H=(VH,EH,LH,XH)H\!=\!(V_{H},E_{H},L_{H},X_{H}) be an induced subgraph of GsG_{s}. Given a weight vector ww, we define score of HH as

S(H)=minvVH{S(v)}.S(H)=\text{min}_{v\in V_{H}}\{S(v)\}. (2)

Here, we have a brief discussion on why the “min” operator in Eq. 2 is used to define S(H)S(H). The intention is the same as that in [4]. By using it, all the members in HH can be ensured to have a score in terms of dd dimensions no less than S(H)S(H). In other words, if S(H)S(H) is large, vertices in HH may not have large values in each dimension but the weighted sum of their attribute values must be large. For instance, consider the previous case of taking the activeness as the main criterion (e.g., weight 0.8) in the collaboration network. Obviously, we can apply the S(H)S(H) defined above to quantify the overall quality focusing on the activeness of a group of collaborators.

Definition 4

(r-dominance.) Given a region RR in the preference domain, a community HH r-dominates another community HH^{\prime} when S(H)S(H)S(H)\!\geq\!S(H^{\prime}) for any weight vector in RR, denoted by HHH\!\succ\!H^{\prime}; and vvv\!\succ\!v^{\prime} denotes that vertex vv r-dominates another vertex vv^{\prime}.

Intuitively, in the traditional sense of dominance [21, 22], for any weight vector ww the dominator is always superior to the dominee, while r-dominance is specific to weight vectors in RR. Although a community may not dominate another based on the skyline community model [8], it might always have a higher score as ww is bounded in RR. For ease of representation, we assume that RR is a hyper-rectangle parallel to axes, but our techniques is directly applicable to general convex polytopes.

For example, Fig. 2(b) illustrates a preference domain (i.e., the domain of weight vector) with d=3d\!=\!3, where axis w1w_{1} (resp. w2w_{2}) corresponds to the weight for x1x_{1} (resp. x2x_{2}) on a scale of 0 to 1, and the region RR is a convex polygon (i.e., an axis-parallel rectangle [0.1,0.5]×[0.2,0.4][0.1,0.5]\!\times\![0.2,0.4]) representing the expanded or approximated user preferences.

II-D The MAC Problem

Combining the notion of (k,t)(k,t)-core and community score, we define the multi-attributed community (MAC) as follows.

Definition 5

(MAC.) Given graphs (Gr,Gs)(G_{r},G_{s}), query vertices QVsQ\subseteq V_{s}, two numbers kk and tt, and a region of interest RR, HH is a multi-attributed community in the road-social network, if HH satisfies the following conditions:

  • 1.

    HH is a (k,t)(k,t)-core containing QQ.

  • 2.

    There does not exist an induced subgraph HH^{\prime} (VHVH)(V_{H^{\prime}}\supset V_{H}) such that HH^{\prime} is a (k,t)(k,t)-core and HHH^{\prime}\succ H.

In terms of structural cohesiveness and communication cost, condition (1) requires not only that the community with query vertices QQ contained is densely connected, but also that each vertex is spatially close to QQ. In terms of maximality and multi-attributed community score, condition (2) ensures that the community is maximal and having the greatest score. The following example illustrates Definition 5.

Example 1

Consider (Gr,Gs)(G_{r},G_{s}), numerical attributes and a region RR in Fig. 1 and 2, respectively. Suppose, for instance, that Q={v2}Q\!=\!\{v_{2}\}, k=2k\!=\!2 and t=9t\!=\!9. By Definition 5, the subgraph induced by {v2,v3,v5,v6,v7}\{v_{2},v_{3},v_{5},v_{6},v_{7}\} is an MAC with community score equal to S(v7)S(v_{7}) if ww freely lies anywhere in the upper-left part of R1R_{1} (divided by dotted line), as it meets both conditions. Note that the subgraph induced by {v2,v3,v5,v7}\{v_{2},v_{3},v_{5},v_{7}\} is not an MAC. This is because it is contained in the MAC induced by {v2,v3,v5,v6,v7}\{v_{2},v_{3},v_{5},v_{6},v_{7}\} with the same community score, thus fails to satisfy condition (2). \hfill{}\Box

For most practical applications, we typically tend to focus on the query-user-involved communities, which score higher than (i.e., r-dominate) all other communities in the preference domain RR. In this paper, we aim to efficiently discover such communities in road-social networks. Below, two multi-attributed community search problems are formulated.

Problem 1. Given graphs (Gr,Gs)(G_{r},G_{s}), query vertices QVsQ\subseteq V_{s}, two numbers kk and tt, and a region of interest RR, the problem is to find the top-jj MACs with the highest community score for each possible weight vector in RR. Although the possible weight vectors are infinite in RR, a partitioning of RR can form the output, in which each partition is associated with the top-jj MACs when ww falls anywhere in that partition.

For Problem 1, an MAC may be contained in another MAC in the top-jj results.

Example 2

Assume that Q={v2,v3,v6}Q\!=\!\{v_{2},v_{3},v_{6}\}, k=3k\!=\!3 and t=9t\!=\!9 in Fig. 1. The top-22 MACs are subgraphs H1H_{1} and H2H_{2} induced by {v2,v3,v6,v7}\{v_{2},v_{3},v_{6},v_{7}\} and {v2,,v7}\{v_{2},\ldots,v_{7}\} for any weight vector in R1R_{1}, respectively. S(H1)=S(v7)S(H_{1})\!=\!S(v_{7}), and S(H2)=S(v4)S(H_{2})\!=\!S(v_{4}) or S(v5)S(v_{5}) on either side of the dotted line. Clearly, H2H_{2} contains H1H_{1}. \hfill{}\Box

To eliminate the containment relations in the top-jj results, we study another problem of finding the non-contained MAC.

Definition 6

(non-contained MAC.) An MAC HH is a non-contained MAC if there does not exist an induced subgraph HH^{\prime} (VHVH)(V_{H^{\prime}}\subset V_{H}) such that HH^{\prime} is a (k,t)(k,t)-core and HHH^{\prime}\succ H.

The following example illustrates Definition 6.

Example 3

Let us reconsider Example 2. By Definition 6, subgraphs H1H_{1} and H3H_{3} (induced by {v2,,v6}\{v_{2},\ldots,v_{6}\}) are the non-contained MACs for any weight vector in R1R_{1} and in R2R3R_{2}\cup R_{3}, respectively. For instance, H3H_{3} (resp. H1H_{1}) is the top-11 result when w=(0.2,0.3)w\!=\!(0.2,0.3) (resp. w=(0.19,0.3)w\!=\!(0.19,0.3)). However, H2H_{2} is not a non-contained MAC as it contains H1H_{1} and H1H2H_{1}\succ H_{2} for any weight vector in R1R_{1}. \hfill{}\Box

Problem 2. Given graphs (Gr,Gs)(G_{r},G_{s}), query vertices QVsQ\subseteq V_{s}, two numbers kk and tt, and a region of interest RR, the problem is to find the non-contained MAC for every possible weight vector in its corresponding partitioning of RR.

Discussions. Another three possible operators “max”, “sum” and “avg” are not appropriate for community score. The first two are monotonic w.r.t. the size of community, that is, a community scores higher than its sub-communities. Hence, the answer is always the maximal (k,t)(k,t)-core, which is independent of numerical attributes XX and region RR. The last one may cause outliers in the answer, e.g., only a few vertices score very high while the rest score low, resulting in a higher community score. Obviously, this is not an ideal community.

Challenges. Solving the above two problems faces three major challenges. First, the number of kk-core^\widehat{core}s containing QQ in a multi-attributed network GsG_{s} can be exponentially large (even regardless of the query distance in GrG_{r}). Thus, enumerating all the kk-core^\widehat{core}s to identify the MACs is intractable. Second, unlike traditional top-jj queries [23], the score of a community may vary greatly at different parts of RR, making it nontrivial to draw conclusions about r-dominance relationships between communities. Third, the MAC model enables a more flexible way to express user preferences in community search problem, which means that inherent inaccuracies in weight specification need to be taken into account. In consequence, without enumerating all the (k,t)(k,t)-cores, devising an efficient algorithm to detect the MACs is challenging. To overcome these challenges, we will develop efficient algorithms in the following sections.

III Warming Up for Our Methods

According to Definition 5, regardless of maximality and community score, the multi-attributed communities have to satisfy the constraints of structural cohesiveness and communication cost. Thus, we give two useful lemmas as follows.

Lemma 1

For a number tt, the vertices of GsG_{s} with query distance greater than tt in GrG_{r} cannot exist in any MAC.

Lemma 2

For an integer kk, the MACs must be contained in the maximal kk-core^\widehat{core} containing QQ.

The correctness of above lemmas can be verified by Definition 2 and the maximality of kk-core, resulting in Lemma 3.

Definition 7

(Maximal (k,t)(k,t)-core.) For graphs (Gr,Gs)(G_{r},G_{s}) and query vertices QQ, the maximal (k,t)(k,t)-core is a (k,t)(k,t)-core such that no super (k,t)(k,t)-core contains it, denoted by HktH_{k}^{t}.

Lemma 3

For two numbers kk and tt, the MACs must be contained in the maximal (k,t)(k,t)-core.

Refer to caption
(a) vv r-dominates
Refer to caption
(b) r-incomparable
Refer to caption
(c) vv r-dominated
Figure 3: Cases of r-dominance for vertices vv and vv^{\prime}

Referring to Lemma 3, for a given kk and tt, we first filter out the vertices of GsG_{s} that do not satisfy the query distance threshold tt by range query in GrG_{r}, which can be accelerated by G-tree [24] or G*-tree [25], to obtain the induced connected subgraph GsG_{s}^{\prime} by the remaining vertices. Next, we do kk-core decomposition [14] on the filtered social subgraph GsG_{s}^{\prime} to compute the maximal kk-core^\widehat{core} containing QQ. It is noting that we employ the upper bound of coreness [2] before core decomposition. If kk is larger than 1+9+8(|Es||Vs|)2\lfloor\frac{1+\sqrt{9+8(|E_{s}^{\prime}|-|V_{s}^{\prime}|)}}{2}\rfloor, we immediately know there is no kk-core^\widehat{core} w.r.t QQ. So far, the maximal (k,t)(k,t)-core has been found such that the MACs can be obtained through in-depth computation. For example, the maximal (3,9)(3,9)-core, i.e., H39H_{3}^{9}, for Q={v2,v3,v6}Q\!=\!\{v_{2},v_{3},v_{6}\} is the subgraph induced by {v1,,v7}\{v_{1},\ldots,v_{7}\}, as shown in Fig. 4(a).

After abandoning the scheme of enumerating all the (k,t)(k,t)-cores whose number can be exponentially large even in HktH_{k}^{t}, intuitively, we may think of iteratively deleting the smallest-score vertex w.r.t. dd-dimensional attributes until the resulting graph does not have a kk-core^\widehat{core} containing QQ. However, at this time we will face another problem, that is, which vertex has the smallest score? To address this issue, we design an effective data structure and construction algorithm to preserve r-dominance relationships between vertices.

IV R-Dominance Graph

In this section, we exploit the r-dominance graph to preserve pair-wise r-dominance relationships between vertices in HktH_{k}^{t}, which will be used in our proposed search algorithms.

IV-A R-Dominance Test

Consider two vertices vv and vv^{\prime} where none dominates the other in terms of traditional dominance, that may not draw a reliable conclusion about which vertex ranks higher. Nevertheless, given a preference domain, the inequation S(v)S(v)S(v)\!\geq\!S(v^{\prime}) (resp. equation S(v)=S(v)S(v)\!=\!S(v^{\prime})) corresponds to a half-space (resp. hyperplane), of which there are three different cases regarding the positioning against RR [13]. Specifically, in Fig. 3(a), vv r-dominates vv^{\prime} since half-space HS:S(v)S(v)HS:S(v)\!\geq\!S(v^{\prime}) completely covers RR, which means vv scores higher for wR\forall w\in R; the case in Fig. 3(c) is symmetric. In Fig. 3(b), vv scores higher in one part of RR but lower in another, which is called r-incomparable as none r-dominates the other.

Clearly, the cases in Fig. 3(a) and 3(c) allow r-dominance conclusions to be safely drawn. In this way, we can determine r-dominance by detecting whether all polygon vertices defining RR fall into the half-space HS:S(v)S(v)HS:S(v)\!\geq\!S(v^{\prime}). If so (resp. not), vv r-dominates (resp. is r-dominated by) vv^{\prime}. Otherwise, vv and vv^{\prime} are r-incomparable. The inclusion detecting of each polygon vertex costs O(d)O(d), so the r-dominance test requires O(pd)O(pd) in total, where pp is the number of polygon vertices defining RR.

IV-B Pair-Wise R-Dominance Relationship

The computation of r-dominance relationships is somewhat similar to that of the kk-skyband (i.e., BBS [26]), but differs as follows. (1) Rather than traditional dominance, we employ r-dominance and apply its test described in Section IV-A both for vertex-to-vertex and vertex-to-MBB (i.e., minimum bounding box) dominance testing. (2) Due to the fact that ww is bounded in RR, we adopt a unique sorting key for R-tree nodes (represented by MBB’s upper-right corner) and vertices in the heap to accelerate search convergence by leading it to r-dominate as many members as possible first. (3) We preserve the r-dominance relationships between vertices in HktH_{k}^{t} instead of only the top-jj layers. It is noting that a max-heap is utilized in our adapted BBS and its sorting key is the score of R-tree node/vertex w.r.t. a pivot vector of RR, whose value of each dimension is the mean of the polygon vertices of RR in that dimension [13]. The correctness of our adaptation is guaranteed as follows. (1) The pivot vector must lie in RR due to RR’s convexity [27]. (2) Vertices popped after vv cannot r-dominate vv due to pivot-based sorting (in decreasing order).

In addition, we adopt a directed acyclic graph (DAG) [28, 22] to maintain all pair-wise r-dominance relationships between vertices in HktH_{k}^{t}, named r-dominance graph (denoted by GdG_{d}). Fig. 4(b) illustrates GdG_{d} of H39H_{3}^{9}. An arc from vertex vv to vv^{\prime} signifies that vv r-dominates vv^{\prime}. It is noting that an arc from v6v_{6} or v2v_{2} to v7v_{7} is not needed as the transitivity of r-dominance relationship already implies this. The number of vertices that r-dominate vv is called vv’s r-dominance count.

Refer to caption
(a) H39H_{3}^{9} for Q={v2,v3,v6}Q=\{v_{2},v_{3},v_{6}\}
Refer to caption
(b) DAG GdG_{d}
Figure 4: The maximal (k,t)(k,t)-core and r-dominance graph

V Global Search

In this section, we develop a global search algorithm for Problem 2 and discuss its generalization for Problem 1. Before proceeding further, three useful lemmas are given as follows.

Lemma 4

The maximal (k,t)(k,t)-core, i.e., HktH_{k}^{t}, is an MAC.

Lemma 5

For any MAC, if we delete the smallest-score vertex w.r.t. any weight ww in RR and the resulting subgraph still has a kk-core^\widehat{core} HH containing QQ, HH is an MAC.

Lemma 6

For any MAC HH, if we delete the smallest-score vertex w.r.t. any weight ww in RR but the resulting subgraph does not have a kk-core^\widehat{core} containing QQ, HH is a non-contained MAC.

In view of the above lemmas, we can devise an efficient algorithm based on depth-first search (DFS) for our problems.

V-A The DFS-based Algorithm

The idea of the DFS-based algorithm is described in detail in Algorithm 1. First, for given kk and tt, we compute the maximal (k,t)(k,t)-core, i.e., HktH_{k}^{t}, and build the r-dominance graph GdG_{d}. Then, the following procedure is iteratively invoked until the resulting graph in each partition of RR does not have a kk-core^\widehat{core} containing QQ. The procedure consists of two steps. Let HH and GdG_{d}^{\prime} be the resulting subgraph and corresponding r-dominance graph in partition ρ\rho of RR (Line 6). The first step is to insert sub-partitions into ρ\rho according to GdG_{d}^{\prime} (Lines 7-9), then find the smallest-score vertex in each sub-partition (Lines 10-11). In Line 9, note that ρ\rho is the root node of a binary tree after being passed in as a parameter, and SS is a set of leaf nodes of the binary tree, representing sub-partitions of ρ\rho. This step is essential and will be elaborated in Section V-B. The second step is to delete all the vertices that are definitely excluded in subsequent MACs, which enables HH and GdG_{d}^{\prime} to be updated accordingly (Lines 15-20).

[Uncaptioned image]

In particular, Algorithm 1 recursively deletes all the vertices violating the structural cohesiveness constraint by a DFS procedure (Lines 16-19) as long as the smallest-score vertex uu is found in the sub-partition. Because the degrees of the adjacent vertices of uu all reduce by 1 when we delete uu. This may cause some neighbors of uu to violate the structural cohesiveness constraint and thus cannot be contained in subsequent MACs. Likewise, we also need to verify whether the neighbors of other hops (e.g., 2-hop, 3-hop, etc.) meet the structural cohesiveness constraint. Obviously, the DFS procedure can be used to identify and delete all these vertices.

According to Lemma 6, we have a corollary as follows.

Corollary 1

Given an MAC HH, HH is a non-contained MAC if the smallest-score vertex uu meets one of the following conditions: (1) uQu\in Q; and (2) deleting uu will recursively disconnect QQ (e.g., qQ\exists q\in Q being deleted) or make the degree of remaining vertices less than kk.

Note that we always consider the early termination conditions of Corollary 1 in conjunction with the DFS procedure. Once either is met, it means that vertex uu cannot be deleted, even if uu is currently the one with the smallest score but HH is already the non-contained MAC w.r.t. partition ρ\rho^{\prime} (Line 12). As a result, if Corollary 1 (i.e., Lemma 6) holds, the top-jj MACs can be obtained by the union of top vertices in heap II^{\prime} (totally backtracking j1j\!-\!1 times) and the last subgraph HH (Line 13). Based on Lemma 4 and 5, we can easily verify that HH for each partition ρ\rho recursively obtained in Line 6 is an MAC. Thus, Theorem 1 shows the correctness of Algorithm 1.

Theorem 1

Algorithm 1 correctly finds the top-jj MACs.

Proof Sketch: For any partition ρ\rho, as long as its current subgraph HH does not hold Corollary 1, ρ\rho will be divided into |S||S| sub-partitions by |HP||HP| hyperplanes (Line 10 in Algorithm 1), e.g., ρi\rho_{i} for 1i|S|1\leq i\leq|S|. Here, ρ\rho is discarded when the recursion proceeds to the promising sub-partitions of the local arrangement, i.e., ρi\rho_{i}. Assume that uu is the smallest-score vertex in HH w.r.t. ρi\rho_{i}, the resulting subgraph, denoted by HiH_{i}, obtained by invoking DFS procedure (Line 12 in Algorithm 1) can be claimed as the maximal kk-core^\widehat{core} of the subgraph H\uH\backslash u by contradiction, because DFS procedure recursively deletes all the vertices in HH whose degrees are smaller than kk. Therefore, we have VHiVHV_{H_{i}}\subset V_{H} and S(H)S(Hi)S(H)\leq S(H_{i}) w.r.t. ρi\rho_{i} for 1i|S|1\leq i\leq|S|. In this way, the non-contained MAC corresponding to each final sub-partition can be found, as well as all the MACs. Thus, we conclude that Algorithm 1 correctly finds the top-jj MACs. \square

We analyze the complexity of Algorithm 1 in Theorem 2.

Theorem 2

The time complexity and space complexity of Algorithm 1 are bounded by O(n2d)O(n^{\prime 2d}) and O(m+n+n2d)O(m^{\prime}+n^{\prime}+n^{\prime 2}\!\cdot\!d) respectively, where nn^{\prime} and mm^{\prime} denote the number of vertices and edges in HktH_{k}^{t}.

Proof:

The key factor determining Algorithm 1’s time complexity is the construction of arrangements. In the worst case, vertices in HktH_{k}^{t} are r-incomparable to each other, i.e., the complete arrangement of n(n1)2\frac{n^{\prime}(n^{\prime}-1)}{2} half-spaces needs to be constructed, in O(n2d)O(n^{\prime 2d}) time [29]. The algorithm only needs to store HktH_{k}^{t}, and maintains the heap II^{\prime} and half-space information related to dd, which uses less than O(m+n+n2d)O(m^{\prime}+n^{\prime}+n^{\prime 2}\!\cdot\!d) space complexity even in the worst case. ∎

Refer to caption
(a) Partitioning RR
Refer to caption
(b) Partitioning ρ1\rho_{1} and ρ2\rho_{2}
Figure 5: Arrangement and partitions in RR

V-B Arrangement Jointing and Indexing

In Algorithm 1, to find the smallest-score vertex for any weight vector in partition ρ\rho, we consider the vertices of GdG_{d}^{\prime} in a bottom-up manner. In other words, leaf vertices of the r-dominance graph will be preferred (Line 7). The reason is that if a vertex is deleted either because it does not satisfy the structural cohesiveness constraint (already considered in the DFS procedure), or because it is the smallest-score vertex, but before this, all vertices it r-dominates should be deleted first.

The verification of a leaf vertex uu in GdG_{d}^{\prime} entails partitioning ρ\rho by half-spaces HSi:S(u)S(u)HS_{i}:S(u^{\prime})\!\geq\!S(u), each corresponding to one of the remaining leaf vertex uu^{\prime}. Formally, an arrangement bounded by RR is defined by the supporting hyperplanes of these half-spaces, where each cell (i.e., sub-partition) is located in a set of half-spaces. The leaf vertices corresponding to these half-spaces are precisely those with scores higher than uu if ww falls in that cell, which means uu is the smallest-score vertex.

Consider GdG_{d} in Fig. 4(b). Initially, the leaf vertices are v7v_{7}, v5v_{5} and v1v_{1} (Gd=Gd,I=G_{d}^{\prime}\!=\!G_{d},I^{\prime}\!=\!\emptyset, Line 6 in Algorithm 1). Their respective half-spaces HS1:S(v7)S(v5)HS_{1}:S(v_{7})\!\geq\!S(v_{5}), HS2:S(v7)S(v1)HS_{2}:S(v_{7})\!\geq\!S(v_{1}) and HS3:S(v1)S(v5)HS_{3}:S(v_{1})\!\geq\!S(v_{5}) are inserted into the arrangement, as shown in Fig. 5(a). The vertex in brackets for each partition indicates the smallest-score vertex for any weight vector ww in that partition. For partitions ρ3\rho_{3} and ρ4\rho_{4} on the right, vertex v1v_{1} will also be deleted by the DFS procedure due to the deletion of v7v_{7}, after which v1v_{1} and v7v_{7} are both pushed into the heap II^{\prime} w.r.t. the partitions (representing the vertices ignored). Thus, the resulting subgraph HH induced by {v2,,v6}\{v_{2},\ldots,v_{6}\} is the corresponding non-contained MAC since discarding any vertex will no longer satisfy the structural cohesiveness constraint. By backtracking the top vertices in II^{\prime} once (i.e., v1v_{1} and v7v_{7}), we can easily obtain the second-ranked MAC induced by {v1,,v7}\{v_{1},\ldots,v_{7}\} in ρ3\rho_{3} and ρ4\rho_{4} (refer to R3R_{3} in Fig. 2(b)).

[Uncaptioned image]

When Corollary 1 does not hold, sub-partition ρ\rho^{\prime} will be pushed into queue UU to compute the non-contained MAC in depth. At this time, vertices ignored will be discarded to update resulting subgraph HH and corresponding GdG_{d}^{\prime}, so that new leaf vertex can be designated in the next round of verification and the half-spaces against other leaf vertices are inserted into the local arrangement. Consider ρ1\rho_{1} in Fig. 5(a), we refer to v4v_{4} as the new leaf vertex after v1v_{1} is deleted, i.e., HH and GdG_{d}^{\prime} are induced by {v2,,v7}\{v_{2},\ldots,v_{7}\} and I={v1}I^{\prime}=\{v_{1}\}. Then, new half-spaces HS4:v7v4HS_{4}:v_{7}\!\geq\!v_{4} and HS5:v5v4HS_{5}:v_{5}\!\geq\!v_{4} are inserted into a newly initialized local arrangement against partition ρ1\rho_{1}, where three sub-partitions ρ5\rho_{5}, ρ6\rho_{6} and ρ7\rho_{7} are produced, as shown in Fig. 5(b). As v4v_{4} and v5v_{5} are pushed into II^{\prime} together, the non-contained MAC induced by {v2,v3,v6,v7}\{v_{2},v_{3},v_{6},v_{7}\} is returned for each sub-partition. Likewise, same operation applies to partition ρ2\rho_{2}. Eventually, the solution in Example 3 is obtained. Note that we can directly locate HS4HS_{4} and HS5HS_{5} for ρ2\rho_{2} since no new half-space needs to be computed due to the same leaf vertices (in GdG_{d}^{\prime}) as ρ1\rho_{1}. This drastically reduces repetitive computation in half-spaces, each of which is computed only once if necessary.

Specifically, for each local arrangement considered, an index is built by a recursive process Partition (Algorithm 2). Then, the index is discarded when all relevant half-spaces are inserted, leaving only the hopeful sub-partitions of the local arrangement (if any). Note that, for any index of ρ\rho (Line 9 in Algorithm 1), the total cost of inserting the ii-th hyperplane hphp is O(id1)O(i^{d-1}) [20]. In addition, optimization of arrangement indexing and maintenance is the same as described in [13].

VI Local Search

Although the efficiency of the global search algorithm is considerable for each query, it may need to explore the entire maximal (k,t)(k,t)-core, especially when query vertices QQ are located at the upper layer of the r-dominance graph GdG_{d}. In this section, we devise the local search algorithms for Problem 2 and investigate the generalization for Problem 1.

[Uncaptioned image]

The intuition is that the non-contained MACs for QQ are in the vicinity of QQ. Thus, the entire HktH_{k}^{t} should be unnecessary to involve during the search. Nonetheless, it is intractable to enumerate all the kk-core^\widehat{core}s containing QQ, whose number can be exponentially large w.r.t. HktH_{k}^{t} size. Accordingly, we only immerse in finding the communities that are most likely to be candidates for non-contained MACs. Their validity and corresponding partitions in RR can be quickly verified by GdG_{d} alone. This inspires us to develop a framework of more efficient local search (Algorithm 3). Specifically, Expand procedure finds candidates (i.e., CC) by selecting the most promising vertex as we explore in the neighborhood of QQ, and stops when each target community forms a kk-core^\widehat{core}. Verify procedure provides guarantee of identifying all valid non-contained MACs w.r.t. RR from CC. Note that the time complexity of Algorithm 3 is bounded by O(|C|sd))O(|C|\!\cdot\!s^{d})) (see Theorem 3 and 4), which is much lower than that of Algorithm 1 (|C||C| and ss are typically very small in practice). As verified in our experiments, local search is at least one order of magnitude faster than global search, and all non-contained MACs will be expanded by our candidate generation strategies in most cases.

In this way, two problems arise: (1) how do we guarantee that the target community can be a candidate for non-contained MACs (at least form a kk-core^\widehat{core} containing QQ); and (2) how do we know whether the kk-core^\widehat{core} is a valid non-contained MAC w.r.t. RR? The former poses a great challenge of determining which vertex to choose and when to terminate expansion, and the latter requires an in-depth study of the characteristics of non-contained MACs. In the following, we present lemmas and algorithms for the local search strategy.

VI-A Candidate Generation

To be a candidate for non-contained MACs, the current community must be at least a kk-core^\widehat{core} containing QQ. It would be nice if the structural cohesiveness metric (i.e., kk-core) is “monotonic”, which means that the larger the community, the smaller its minimum degree. So once the minimum degree drops below the given coreness threshold kk, we can stop the search. Unfortunately, such a metric is not monotonic. Thus, the greatest challenge is to overcome non-monotonicity first, which motivates us to conduct community search only by exploring local neighborhood of QQ.

Now for the minimum degree of a subgraph HH, denoted by δ(H)\delta(H), we make an in-depth analysis of its monotonicity. Consider the exploration starting from the query vertices, i.e., H0H_{0} induced by QQ. We add a vertex from HktH_{k}^{t} at each step until a kk-core^\widehat{core} HH is obtained, assuming that v1,v2,,vev_{1},v_{2},\ldots,v_{e} is a vertex sequence it adds. So let HiH_{i} be induced by Q{v1,,vi}Q\cup\{v_{1},\ldots,v_{i}\}. In general, δ(H)\delta(H) is a non-monotonic function of HH. More formally, δ(Hi+1)\delta(H_{i+1}) is unnecessarily greater than δ(Hi)\delta(H_{i}). However, the order of vertices added to VHV_{H} determines the monotonicity of δ(H)\delta(H). Interestingly, for any QQ with δ(H0)=0\delta(H_{0})=0, we can always find a sequence of added vertices such that δ(Hi)\delta(H_{i}) is a non-decreasing function of ii.

Lemma 7

For any query vertices QVHQ\subseteq V_{H} with δ(H0)=0\delta(H_{0})=0 in graph HktH_{k}^{t}, there always exists an added vertex sequence v1,v2,,vev_{1},v_{2},\ldots,v_{e} of HH starting with QQ such that 0ie,δ(Hi)δ(Hi+1)\forall 0\leq i\leq e,\delta(H_{i})\leq\delta(H_{i+1}).

Proof Sketch: It is consistent with proving that vertices can be removed one by one from VHV_{H} until QQ, just ensuring that each removal of vv does not increase the minimal degree of the remaining vertices. Otherwise, it occurs only when vv is currently one of the vertices with the minimal degree. The reason is that removing a vertex with non-minimal degree will only preserve or decrease the minimal degree. \square

[Uncaptioned image]

Lemma 7 implies that there always exists an exploration order that leads monotonically to a community of kk-core^\widehat{core} containing QQ. This can be generated by a sequence of vertices added to VHV_{H} starting from QQ so that δ(Hi)δ(Hi+1)\delta(H_{i})\leq\delta(H_{i+1}) for each ii. Note that the existence of such an order is a necessary but insufficient condition for finding a valid community. To illustrate the insufficiency, consider Q={v9}Q=\{v_{9}\} and k=2k=2 in Fig. 1(a). Any vertex sequence starting with v9,v14v_{9},v_{14} cannot yield a valid solution, yet δ(H1)\delta(H_{1}) is greater than δ(H0)\delta(H_{0}).

In Expand procedure (Algorithm 4), we explore from the vicinity of QQ by BFS and generate candidate set CC. To converge the current community towards a candidate for non-contained MAC, we develop two intelligent candidate selection strategies according to Lemma 36 and 7. The idea of improving candidate generation is to use priority queues such that the most promising vertex can be selected to rapidly generate a candidate. From the perspective of structural cohesiveness, the priority of a vertex vv can be defined as f1(v)=δ(H)δ(H)f_{1}(v)=\delta(H^{\prime})-\delta(H) or f2(v)=dgH(v)f_{2}(v)=dg_{H^{\prime}}(v), where VH=VH{v}V_{H}^{\prime}=V_{H}\cup\{v\}. f1(v)f_{1}(v) emphasizes the improvement of minimum degree for the next step, with f1(v)=1f_{1}(v)=1 or 0 for any vv; f2(v)f_{2}(v) produces the fastest increase in average degree of HH so that the minimum degree will increase with HH’s density growth. From the perspective of community score, the priority of vv can be defined as f3(v)=ζl(v)f_{3}(v)=\zeta-l(v), where ζ\zeta is a constant (maximum priority in GdG_{d}) and l(v)l(v) denotes the layer of vv in GdG_{d}. f3(v)f_{3}(v) drives community score higher by adding a vertex that r-dominates as many vertices as possible. To sum up, the priority f(v)f(v) is defined as

f(v)=λf2(v)+f3(v),f(v)=\lambda\!\cdot\!f_{2}(v)+f_{3}(v), (3)

where λ\lambda is a trade-off against ζ\zeta, or

f(v)=ζf1(v)+f3(v).f(v)=\zeta\cdot f_{1}(v)+f_{3}(v). (4)
Theorem 3

The time complexity of Algorithm 4 by Eq. 3 and Eq. 4 is O(n¯+m¯)O(\overline{n}+\overline{m}) and O(n¯+m¯logn¯)O(\overline{n}+\overline{m}log\overline{n}) respectively, where n¯\overline{n} and m¯\overline{m} denote the number of vertices and edges in CC. The space complexity is O(n¯+m¯)O(\overline{n}+\overline{m}).

VI-B Verification

The determinant of global search is that computing the local arrangement of all half-spaces HSiHS_{i} among leaf vertices is a relatively expensive process (in O(id)O({i^{*}}^{d}) time [29], where ii^{*} is the number of half-spaces). Instead, in Verify procedure (Algorithm 5), an empty arrangement in RR is initialized, into which half-spaces w.r.t. a carefully selected and therefore very small subset of vertices (i.e., competitors below) are inserted, expecting to securely confirm or disqualify candidate HH without considering all other vertices. But before this, we first give a corollary to filter out unpromising candidates from CC. Note that GeG_{e} represents the r-dominance graph corresponding to HH, which is a subgraph of GdG_{d} induced by VHV_{H}, denoted by Gd[VH]G_{d}[V_{H}]; and GcG_{c} represents the rest of GdG_{d}, denoted by Gd[VGd\VH]G_{d}[V_{G_{d}}\!\backslash\!V_{H}].

Corollary 2

A community HH can be discarded if one of the following conditions is met: (1) vVGd\VH\forall v\!\in\!V_{G_{d}}\!\backslash\!V_{H}, vv is a non-leaf vertex in GdG_{d}; and (2) vVGd\VH,vVH\exists v\!\in\!V_{G_{d}}\!\backslash\!V_{H},v^{\prime}\!\in\!V_{H} and vvv\!\succ\!v^{\prime}, vv cannot be recursively deleted by deleting vertices in VGd\VHV_{G_{d}}\!\backslash\!V_{H}.

Proof Sketch: Vertices in VGd\VHV_{G_{d}}\!\backslash\!V_{H} must be deleted if HH holds. \square

In other words, either there exists a non-cross-layer arc in GcG_{c} between a non-leaf vertex vv and a leaf vertex, or vv can be recursively deleted by the DFS procedure. We refer to the HH that holds Corollary 2 as a promising community.

Lemma 8

For any promising HH, vv is regarded as an anchor if HH still forms a kk-core^\widehat{core} after a (non-QQ) leaf vertex vv in GeG_{e} is deleted.

Now we discuss the verification process by half-space insertion, in which it may further benefit from the r-dominance relationships stored in GdG_{d} as follows.

Lemma 9

Consider uu, uu^{\prime}, and their half-space HSiHS_{i} inserted into the arrangement. Assume that u′′u^{\prime\prime} is a vertex r-dominated by uu^{\prime}, and ρ\rho is a partition in the arrangement not covered by HSiHS_{i}. Thus uu is guaranteed to r-dominate u′′u^{\prime\prime} in partition ρ\rho.

Proof:

First, from the definition of half-space HSiHS_{i}, S(u)<S(u)S(u^{\prime})\!<\!S(u) holds anywhere outside HSiHS_{i}. Second, S(u′′)S(u)S(u^{\prime\prime})\!\leq\!S(u^{\prime}) holds anywhere inside RR by the definition of r-dominance. Thus, we derive that for each partition ρ\rho of the arrangement outside HSiHS_{i}, S(u′′)<S(u)S(u^{\prime\prime})\!<\!S(u), that is, uu r-dominates u′′u^{\prime\prime} in ρ\rho. ∎

[Uncaptioned image]

Based on Lemma 9, competitors chosen are the vertices in the bottom layer (leaf vertices) of GeG_{e} and in the top layer (with r-dominance count 0) of GcG_{c}, denoted by lb(Ge)l_{b}(G_{e}) and lt(Gc)l_{t}(G_{c}) respectively. The fundamental is that intuitively such competitors are the strongest, which are most likely to assist in disqualifying invalid candidates w.r.t. RR in turn.

Corollary 3

For any promising HH, it is a valid non-contained MAC if a partition exists in RR such that all vertices in lb(Ge)l_{b}(G_{e}) score higher than those in lt(Gc)l_{t}(G_{c}), and the corresponding conditions are met:

  • 1.

    If Lemma 8 holds, additionally, all anchors need to score higher than other leaf vertices in GeG_{e}.

  • 2.

    vlt(Gc)\exists v\!\in\!l_{t}(G_{c}), if vv can be recursively deleted by the DFS procedure starting from lb(Gc)l_{b}(G_{c}) (namely, vv is bound), then lt(Gc)l_{t}(G_{c}) is updated where vv is ignored and replaced by the vertices of its next layer in GcG_{c}.

  • 3.

    v,vlt(Gc)\exists v,v^{\prime}\!\in\!l_{t}(G_{c}), if vv and vv^{\prime} are bound to each other, then vertices in lb(Ge)l_{b}(G_{e}) only need to score higher than vv or vv^{\prime}.

Proof:

According to Corollary 2, all vertices in VGd\VHV_{G_{d}}\!\backslash\!V_{H} have to be deleted when HH is a promising community. In other words, vertices in VGd\VHV_{G_{d}}\!\backslash\!V_{H} are either recursively deleted by deleting those in the lower layer of GcG_{c}, or deleted individually due to their lower scores. First, we consider the latter case. As a sufficient and necessary condition, these vertices just need to score lower than those in lb(Ge)l_{b}(G_{e}); that is, only vertices in lb(Ge)l_{b}(G_{e}) score higher than those in lt(Gc)l_{t}(G_{c}). Then, we consider the former case. That is, if condition (2) holds, it means that the restriction on half-spaces between the competitors can be relaxed by the newly updated vertices in lt(Gc)l_{t}(G_{c}), since such a vertex vv will be deleted anyway. Similarly, if condition (3) holds, the restriction on half-spaces between the competitors can also be relaxed through such vertices, because deletion of one will also lead to deletion of the other. On this basis, HH becomes a non-contained MAC when Lemma 8 does not hold; otherwise, condition (1) has to be satisfied. This is because such an anchor can still be deleted as it is a non-QQ leaf vertex in GeG_{e}, i.e., possibly the current smallest-score vertex in HH. Putting it all together, we ensure the correctness of the partition corresponding to the non-contained MAC HH (if any in RR). ∎

To illustrate Algorithm 5, let us reconsider Example 2. Assume that by Algorithm 4 we have three promising communities H1H_{1}, H3H_{3} and H4H_{4}, where VH1={v2,v3,v6,v7}V_{H_{1}}\!=\!\{v_{2},v_{3},v_{6},v_{7}\}, VH3={v2,,v6}V_{H_{3}}\!=\!\{v_{2},\ldots,v_{6}\} and VH4={v1,v2,v3,v6,v7}V_{H_{4}}\!=\!\{v_{1},v_{2},v_{3},v_{6},v_{7}\}. For H1H_{1}, lb(Ge)={v7}l_{b}(G_{e})\!=\!\{v_{7}\} and lt(Gc)={v4,v5}l_{t}(G_{c})\!=\!\{v_{4},v_{5}\}. As v4v_{4} and v5v_{5} are bound to each other in H39H_{3}^{9} (condition (3) met), we only insert HS1HS_{1} and HS4HS_{4} into RR in Fig. 5(b), and choose the partitions covered by either of them. As a result, H1H_{1} is a valid non-contained MAC for any weight vector of R1R_{1} in Fig. 2(b). Similarly, H3H_{3} is also valid w.r.t. R2R3R_{2}\cup R_{3} (condition (2) met) but H4H_{4} is invalid (condition (1) met) as its partition is outside RR.

Theorem 4

The time complexity of Algorithm 5 is O(|C|(n+m)+csd)O(|C|\!\cdot\!(n^{\prime}\!+\!m^{\prime})+c\!\cdot\!s^{d}), where |C||C| and cc denote the number of candidates and the number of promising communities in CC respectively, and ss is the product of |lb(Ge)||l_{b}(G_{e})| and |lt(Gc)||l_{t}(G_{c})|. The space complexity is bounded by O(csd+n¯+m¯+n+m)O(c\!\cdot\!s\!\cdot\!d+\overline{n}+\overline{m}+n^{\prime}+m^{\prime}).

Finally, we can simply generalize local search for Problem 1. As the non-contained MAC HH with corresponding partition ρ\rho is known, we insert sub-partitions into ρ\rho according to GcG_{c} (in a up-bottom manner) and add the highest-score vertex to HH. As long as the current HH contains an MAC, it will be output. The process terminates until all top-jj MACs in ρ\rho are acquired. Consider H1H_{1} and its GcG_{c}, half-spaces among vertices in lt(Gc)l_{t}(G_{c}) (i.e., HS5HS_{5}) are inserted first into R1R_{1}. Then v5v_{5} is added to VH1V_{H_{1}} for ρ5\rho_{5} and ρ8\rho_{8} shown in Fig. 5(b). As v4v_{4} r-dominates v1v_{1}, we have the second MAC induced by {v2,,v7}\{v_{2},\ldots,v_{7}\} in ρ5\rho_{5} and ρ8\rho_{8}. The same applies to ρ6\rho_{6} and ρ7\rho_{7}.

VII Experiments

TABLE II: Datasets (K=10310^{3} and M=10610^{6})
Dataset Vertices Edges dgavgdg_{avg} dgmaxdg_{max} kmaxk_{max}
San Francisco (SF) 175K 223K 2.55 8 -
Florida (FL) 1.1M 1.4M 2.53 12 -
Slashdot 79K 0.5M 13 2,507 85
Delicious 536K 1.4M 5 3,216 34
Lastfm 1.2M 4.5M 7 5,150 71
Flixster 2.5M 7.9M 6 1,474 69
Yelp 3.6M 9.0M 5 10,433 129

Comprehensive experiments are conducted to evaluate the proposed model and four algorithms, named GS-T, GS-NC, LS-T and LS-NC respectively. GS-T and GS-NC (resp. LS-T and LS-NC) are global search algorithms (resp. local search algorithms) used to compute the top-jj MACs and the non-contained MACs, respectively. Note that either of the two candidate selection strategies in Section VI-A can be adopted in LS-T or LS-NC, and we just give the results by using Eq. 3 with ζ=100\zeta\!=\!100 and λ=10\lambda\!=\!10 (results by using Eq. 4 are similar and omitted to save space). All algorithms were implemented in C++, and all experiments were conducted on an Ubuntu server with 2GHz Intel Xeon E7-4820 CPU and 1TB memory.

Datasets. We use five real-world social networks222http://networkrepository.com,333https://www.yelp.com/dataset and two road networks (SF444https://www.cs.utah.edu/~lifeifei/SpatialDataset.htm/FL555http://www.dis.uniroma1.it/challenge9/index.shtml) in our experiments. Table II summarizes the statistics of datasets, of which dgavgdg_{avg}, dgmaxdg_{max} and kmaxk_{max} denote the average degree, the maximal degree and the maximal core number, respectively. Note that numerical attributes are not contained in the first four original social networks2, for which we employ a widely used method in [21] to generate three different types of numerical attributes, i.e., independence, correlation and anti-correlation. Due to space limit, we report the results obtained from datasets with independent and real attributes. In addition, we map each user vv to a spatial point pp in the road network that matches the scale of his/her social network as follows: we project SF/FL into range [0,1][0,1] in each dimension and generate normalized L(v)L(v) by drawing from a list of recent check-ins. Assume that pp is the current location of vv if it has the smallest Euclidean distance to L(v)L(v) in the projection space.

TABLE III: Parameters
Parameter Tested values
kk 4, 8, 16, 32, 64
tt (SF/FL) 600/800, 800/1000, 1000/1200, 1200/1400, 1400/1600
dd 2, 3, 4, 5, 6
|Q||Q| 1, 4, 8, 16, 32
jj 5, 10, 20, 40, 60
σ\sigma 0.1%, 0.5%, 1%, 5%, 10%

Parameters. We vary 66 parameters: structural cohesiveness kk, query distance tt, dimensionality dd, number of query users |Q||Q|, number jj of top MACs, and percentage σ\sigma of axis length (i.e., side-length of RR). Table III shows the range of parameters and their default values (in bold). For each value of |Q||Q|, we randomly select 100100 sets of query vertices, that satisfy tt and can ensure the existence of the maximal (k,t)(k,t)-core, from the kk-core of each social network. In each experiment, only one parameter varies and the rest remains at the default. Every reported measurement is the average of 1,0001,000 MAC searches for ten randomly generated axis-parallel hypercubes RR in the preference domain.

VII-A Performance Evaluation

Refer to caption
(a) Varying kk
Refer to caption
(b) Varying tt
Refer to caption
(c) Varying dd
Refer to caption
(d) Varying |Q||Q|
Refer to caption
(e) Varying jj
Refer to caption
(f) Varying σ\sigma
Figure 6: Efficiency and scalability of proposed algorithms in SF+Slashdot with independent attributes.
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying tt
Refer to caption
(c) Varying dd
Refer to caption
(d) Varying |Q||Q|
Refer to caption
(e) Varying jj
Refer to caption
(f) Varying σ\sigma
Figure 7: Efficiency and scalability of proposed algorithms in SF+Delicious with independent attributes.
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying tt
Refer to caption
(c) Varying dd
Refer to caption
(d) Varying |Q||Q|
Refer to caption
(e) Varying jj
Refer to caption
(f) Varying σ\sigma
Figure 8: Efficiency and scalability of proposed algorithms in FL+Lastfm with independent attributes.
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying tt
Refer to caption
(c) Varying dd
Refer to caption
(d) Varying |Q||Q|
Refer to caption
(e) Varying jj
Refer to caption
(f) Varying σ\sigma
Figure 9: Efficiency and scalability of proposed algorithms in FL+Flixster with independent attributes.
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying tt
Refer to caption
(c) Varying dd
Refer to caption
(d) Varying |Q||Q|
Refer to caption
(e) Varying jj
Refer to caption
(f) Varying σ\sigma
Figure 10: Efficiency and scalability of proposed algorithms in FL+Yelp with independent attributes.

Exp-1: Varying kk. We evaluate the query processing time of all algorithms and the number of vertices of HktH_{k}^{t} by varying kk. In each road-social network, we can see that local search performs better than global search, but the advantage becomes less obvious when kk increases. In the best case, e.g., k=4k\!=\!4, LS-T and LS-NC are more than one order of magnitude faster than GS-T and GS-NC in Fig. 7(a) and 9(a). Only when k=64k\!=\!64, global search is comparable to local search since HktH_{k}^{t} size shrinks when kk is large (see Fig. 11(c)), resulting in a reduction in the time complexity of global search and in the number of promising vertices involved in local search. Although Delicious is larger than Slashdot, but algorithms run faster at k=16k\!=\!16 and k=32k\!=\!32 since HktH_{k}^{t} contains fewer vertices (as kmax=34k_{max}\!=\!34 in Table II). Note that when kk increases from 44 to 88, local search is merely more than twice as fast, e.g., LS-NC takes 4141s and 1717s respectively in Fig. 7(a). This is because when kk is relatively small, candidate selection strategies tend to find more smaller candidates. However, algorithms run faster in Yelp than in Flixster, yet its HktH_{k}^{t} size is larger. This will be explained in Exp-6. In short, across a wide range of kk local search is consistently better than global search.

Exp-2: Varying tt. We evaluate the query processing time of all algorithms by varying tt. For each road-social network, local search outperforms global search significantly, with advantage becoming more obvious as tt increases. For example, in Fig. 9(b), GS-NC takes 1,9921,992s while LS-NC takes 245245s for t=1600t\!=\!1600. Note that the results are obtained in the case of k=16k\!=\!16, thus generally LS-T and LS-NC are almost one order of magnitude faster than GS-T and GS-NC in terms of tt. Because when tt is large, more users are retained via range query accelerated by G-tree [24] or G*-tree [25]. This favors local search radiating outward from QQ, equivalent to increasing the expansion radius.

Exp-3: Varying dd. By varying dd we study the query processing time and the memory overhead of all algorithms, and comparison of different methods. The toughness of MAC search rises with dd due to its computational geometric nature. Nonetheless, all four algorithms offer practical processing time, taking respectively 159159s and 4545s for d=6d=6 in Fig. 9(c). Furthermore, Fig. 11(d) shows the memory overhead of GS-NC/LS-NC and BBS process (XX indexed and GdG_{d} built). When dd increases, dimension of R-tree increases but memory overhead changes not much due to unchanged GdG_{d} size; local search is very lightweight against global search. For example, GS-NC takes 664664MB and 2,2192,219MB while LS-NC takes only 5252MB and 152152MB for d=3d\!=\!3 and d=6d\!=\!6, respectively. The results confirm both the theoretical analysis and the claims on arrangement indexing in Section V-B. In addition, Fig. 13 and 14 show the comparison with methods in [4] and [8], where Influ (resp. Influ+) is the DFS-based (resp. ICP-index based) algorithm, and Sky (resp. Sky+) is the basic (resp. space-partition) algorithm. We implement Influ and Influ+ by varying kk instead of dd since they can only capture 11-dimensional attribute. For a fair comparison, 100100 weight vectors that fall anywhere in RR are randomly selected to respectively calculate the weighted sum of dd (at the default) numerical attributes as vertex influence (i.e., score), and the average processing time is reported in Fig. 13(a) and 14(a). Since no r-dominance graph needs to be maintained and no half-space has to be computed nor inserted, Influ and Influ+ are superior to GS-NC and LS-NC in terms of processing time, while Sky and Sky+ are generally the most expensive due to their time complexity. On the other hand, in terms of dd, Sky and Sky+ are much costlier than ours and intractable when dd is relatively large, e.g., d3d\!\geq\!3 and d5d\!\geq\!5 respectively in Fig. 14(b). Here, “Inf” means processing time exceeds 10,00010,000s. Therefore, our model and algorithms are tractable and scalable to handle real-world applications comprehensively and flexibly.

Exp-4: Varying |Q||Q|. We evaluate the query processing time of all algorithms and the ratio of non-contained MACs (NC-MACs) found by LS-NC to GS-NC by varying |Q||Q|. All processing time almost monotonically decreases with the growth of |Q||Q| since it accelerates the convergence of both global and local search. The reason is that increasing |Q||Q| means more vertices cannot be deleted in global search and selecting fewer vertices may find a candidate in local search. Note that in Fig. 7(d), as k=16k\!=\!16, algorithms are not much faster at Q=32Q\!=\!32 than Q=16Q\!=\!16. In fact, we also find that global search terminates early if the generated query vertices QQ are located in the lower layer of GdG_{d} because it is more likely to encounter QQ when deleting vertices, while local search is just the opposite. Fig. 12(a) and 12(b) show the ratio against kk and |Q||Q|, respectively. We can see that both the ratios decrease with the growth of kk and QQ, but it is still satisfactory. The reason is that the number of community candidates expanded by the Expand procedure tends to decrease with the increase of kk or |Q||Q|, but the Verify procedure can always ensure the correctness of the corresponding partitions in RR of any (promising) community. On the other hand, this proves that local search is very useful for applications expecting only part of MACs. Note that, in practice, |Q||Q| is usually not very large and the ratio reaches 95%95\% at default |Q||Q|, which confirms that local search can find all non-contained MACs in most cases.

Refer to caption
(a) No. of partitions (during search)
Refer to caption
(b) No. of non-contained MACs
Refer to caption
(c) #Vertices of HktH_{k}^{t}
Refer to caption
(d) Memory overhead (FL+Lastfm)
Figure 11: Scalability of proposed algorithms.
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying |Q||Q|
Figure 12: Ratio of NC-MACs found by LS-NC to GS-NC in FL+Lastfm.
Refer to caption
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying dd
Figure 13: Comparison of different methods in SF+Delicious (independence).
Refer to caption
Refer to caption
(a) Varying kk
Refer to caption
(b) Varying dd
Figure 14: Comparison of different methods in FL+Flixster (independence).

Exp-5: Varying jj. We examine the query processing time of GS-T and LS-T by varying jj. The curve of GS-T is rising slowly with increasing jj since the top-jj MACs can be directly obtained after executing global search. But for LS-T, after obtaining the non-contained MAC, its corresponding cell has to be divided again to find the top-jj MACs, resulting in an increase in processing time.

Exp-6: Varying σ\sigma. We evaluate the query processing time of all algorithms, and the number of partitions and non-contained MACs by varying σ\sigma (determining the size of region RR). As anticipated, a larger RR means a larger output, thereby more computations needed. In Fig. 11(a), we can see that the growth of σ\sigma will lead to a significant increase in the number of partitions in RR, which also explains the increase in processing time of all algorithms. Fig. 11(b) records the relationship between the number of non-contained MACs obtained by GS-NC and σ\sigma. Similarly, as the number of partitions increases, the diversity of non-contained MACs w.r.t. RR will also increase. Note that both bars of Yelp in Fig. 11(a) and 11(b) are much shorter than those of Flixster while its HktH_{k}^{t} size is larger in Fig. 11(c). Since attributes not only in Yelp but in real world are usually correlated or more, fewer (even unique) branches in DAG result in less processing time, that is, less half-space computation and insertion.

VII-B Case Study

NA+Aminer. We apply the road network of North America4 (NA) and the Aminer (aminer.org) for the first case study. The Aminer is a scientific collaboration network that incorporates authors in DB, DM, IR, and ML fields, comprised of 109,931109,931 vertices and 300,000300,000 edges. For each author we crawl four numerical attributes: h-index, #publications, activeness, and diverseness. To evaluate the effectiveness of MAC model in real world, we map each author to the location in NA according to its affiliation and use Q={Q\!=\!\{“Jiawei Han”, “Jian Pei”, “Philip S. Yu”, “Xifeng Yan”}\}, who are renowned scientists in DM (i.e., relatively high scores), as query vertices. After setting k=5,j=2k\!=\!5,j\!=\!2 and R=[0.1,0.3]×[0.3,0.5]×[0.05,0.1]R\!=\![0.1,0.3]\!\times\![0.3,0.5]\!\times\![0.05,0.1] (with tt large enough), Fig. 15(a-d) show the top-22 MACs anywhere in RR. Furthermore, we compare MAC with different models. For InfC [4], Fig. 15(f, g) report the results involving QQ, respectively taking #publications only and weighted sum (by w=(0.3,0.4,0.1)w\!=\!(0.3,0.4,0.1)) as influence. In fact, InfC either cannot capture the characteristics of all attributes, or must be covered by a non-contained MAC (NC-MAC) if wRw\!\in\!R (e.g., Fig. 15(c)). For SkyC [8], there are two results, one is the same as NC-MAC in Fig. 15(a), while the other shown in Fig. 15(e) only contains partial QQ and is covered by NC-MAC in Fig. 15(c). In effect, we find that SkyC is always contained in NC-MACs due to no query vertices and its skyline property. For ATC [7], Fig. 15(h) reports the (6,2)(6,2)-truss w.r.t. QQ and keyword “DM” as a (k+1)(k\!+\!1)-truss is a kk-core. Although communication cost is low (i.e., 22), its size is still too large since it only considers maximum inclusion of keywords but ignores attributes. Therefore, the MAC model is very effective, comprehensive and flexible for applications.

Refer to caption
(a) top-11 NC-MAC
Refer to caption
(b) top-22 MAC
Refer to caption
(c) top-11 NC-MAC
Refer to caption
(d) top-22 MAC
Refer to caption
(e) SkyC
Refer to caption
(f) InfC (11-D)
Refer to caption
(g) InfC (wRw\!\in\!R)
Refer to caption
(h) ATC (“DM”)
Figure 15: Case study of Aminer+NA: results for k=5k=5.

SF+Yelp. We apply the SF and the Yelp3 for the second case study. In addition to profile information (e.g., ID, first name, etc.), users in Yelp also have real attribute data, such as average rating of all reviews, #reviews written, #hot compliments and so on. Note that the Yelp is more like an LBSN composed of many relatively small ego-like networks, in which ego-like users usually have more fans or followers, write more reviews and receive more compliments. That is, these ego-like users are highly active, showing high attribute values in each dimension; while ordinary users have very low attribute values in all dimensions (e.g., only browsing without posting). In fact, we find that in real world attributes are usually correlated or more, e.g., most users in Yelp have an attribute value of 0. Thus, the number of branches in DAG will be extremely small or even unique, resulting in less half-space computation/insertion and fewer partitions in RR. We map each user to the location in SF according to check-ins and use Q={Q\!=\!\{“Emi”, “Phil”, “Dani”, “Michelle”}\}, who are relatively active, as query vertices. To discover a group of people who are more concerned and popular, #hot compliments, #more compliments and #photo compliments are used as three numerical attributes here. By setting k=6,t=300,j=3k\!=\!6,t\!=\!300,j\!=\!3 and R=[0.4,0.5]×[0.1,0.2]R\!=\![0.4,0.5]\!\times\![0.1,0.2], Fig. 16 shows the top-33 MACs in RR. This fully illustrates that in the real world, the diversity of (non-contained) MACs and the number of corresponding partitions w.r.t. RR are very small and user-friendly.

Refer to caption
(a) top-11 NC-MAC
Refer to caption
(b) top-22 MAC
Refer to caption
(c) top-33 MAC
Figure 16: Case study of Yelp+SF: results for k=6k=6.

VIII Related Work

Community model and search. A large number of community models have been proposed such as kk-core [12, 2], kk-truss [3], maximal clique [30], quasi-clique [15], maximal kk-edge connected subgraph [31, 32, 33], locally densest subgraph [34], query-biased density [35], etc. All these models consider only graph structural information but ignore numerical/textual attributes associated with vertices. To discover cohesive subgraphs containing the query vertices, a community search problem (CSP) was studied to find the maximal connected kk-core in social networks [1], for which [2] proposed a more efficient local search algorithm. Recently, [7] introduced the CSP of small-diameter kk-truss with similar query attributes. [5] and [6] developed the CSP of kk-core with textual attributes and smallest minimum covering circle, respectively. [4] proposed an influential community with vertex’s influence as one numerical attribute, based on which [8] studied a skyline community for dd-dimensional numerical attributes. More recently, [36] studied the CSP of kk-truss with distance of at most dd for any two vertices. [37] and [38] investigated a cohesive version of CSP that brings all community members closest to the point-of-interest in road networks. [39] and [17] studied two different CSPs in terms of context with only query keywords but no query vertices. In addition, [40] and [41] made variations on the CSPs in [7] and [36], respectively.

This work differs from all the prior work in the following. (1) Our multi-attributed community (MAC) model is the first one that can incorporate uncertainty of user preferences in the weight vector into dd-dimensional numerical attributes and capture spatial cohesiveness between users in road networks. (2) The preference domain is introduced into community modeling for the first time. (3) We study the novel MAC search problems in road-social networks such that our techniques are significantly different from all previous CSP algorithms.

Skyline and its generalization. The r-dominance graph used in our MAC model is relevant to the skyline [21] and more to its extension, the kk-skyband [26]. In traditional top-jj queries, if two records are inconsistent and one has no smaller value in any dimension [21], then it dominates the other. Thus, for a dataset the skyline consists of the records which are not dominated by any other; while the kk-skyband comprises those dominated by fewer than kk other records [26], indicated as a superset of all records which for any weight vector may occur in the top-jj results. As a typical kk-skyband computation algorithm, BBS [26] adopts a spatial index in the dataset, following the branch-and-bound paradigm [42].

This proves an additional essential difference between [8] and our work from another perspective. Regardless of social and spatial cohesiveness, dominance in [8] between two communities comes down to a series of standard dominance tests on dd-dimensional vectors. However, r-dominance tests adopted in our model do not suffice, e.g., two or more non-skyline communities may still collaboratively disqualify a skyline one if they score higher than it at different parts of RR, collectively blocking it from being a top-jj result anywhere in RR.

IX Conclusions

In this paper, we propose a novel community model to discover normative communities suitable for multi-criteria decision making in a road-social network, in which each user is linked with location information and dd (1\geq\!1) numerical attributes. Taking a preference region of dd-dimensional data domain as input, the resulting communities identified by our model cannot be r-dominated by other ones as long as the weight vector could fall anywhere in the region. We formalize the multi-attributed community search; distinguish two problem versions; develop solutions for corresponding processing; and using both real-world and synthetic datasets demonstrate the efficiency and scalability of our solutions and the effectiveness of our model.

Acknowledgment

This work is supported by NSFC (No. 61932004, 61732003, 61729201 and 62072087) and Fundamental Research Funds for the Central Universities (No. N181605012). Ye Yuan is the corresponding author.

References

  • [1] M. Sozio and A. Gionis, “The community-search problem and how to plan a successful cocktail party,” in SIGKDD, 2010, pp. 939–948.
  • [2] W. Cui, Y. Xiao, H. Wang, and W. Wang, “Local search of communities in large graphs,” in SIGMOD, 2014, pp. 991–1002.
  • [3] X. Huang, L. V. Lakshmanan, J. X. Yu, and H. Cheng, “Approximate closest community search in networks,” PVLDB, vol. 9, no. 4, pp. 276–287, 2015.
  • [4] R. Li, L. Qin, J. X. Yu, and R. Mao, “Influential community search in large networks,” PVLDB, vol. 8, no. 5, pp. 509–520, 2015.
  • [5] Y. Fang, R. Cheng, S. Luo, and J. Hu, “Effective community search for large attributed graphs,” PVLDB, vol. 9, no. 12, pp. 1233–1244, 2016.
  • [6] Y. Fang, R. Cheng, X. Li, S. Luo, and J. Hu, “Effective community search over large spatial graphs.” PVLDB, vol. 10, no. 6, pp. 709–720, 2017.
  • [7] X. Huang and L. V. Lakshmanan, “Attribute-driven community search,” PVLDB, vol. 10, no. 9, pp. 949–960, 2017.
  • [8] R. Li, L. Qin, F. Ye, J. X. Yu, X. Xiao, N. Xiao, and Z. Zheng, “Skyline community search in multi-valued networks,” in SIGMOD, 2018, pp. 457–472.
  • [9] T. Joachims, “Optimizing search engines using clickthrough data,” in SIGKDD, 2002, pp. 133–142.
  • [10] B. Jiang, J. Pei, X. Lin, D. W. Cheung, and J. Han, “Mining preferences from superior and inferior examples,” in SIGKDD, 2008, pp. 390–398.
  • [11] L. Qian, J. Gao, and H. Jagadish, “Learning user preferences by adaptive pairwise comparison,” PVLDB, vol. 8, no. 11, pp. 1322–1333, 2015.
  • [12] S. B. Seidman, “Network structure and minimum degree,” Social networks, vol. 5, no. 3, pp. 269–287, 1983.
  • [13] K. Mouratidis and B. Tang, “Exact processing of uncertain top-k queries in multi-criteria settings,” PVLDB, vol. 11, no. 8, pp. 866–879, 2018.
  • [14] V. Batagelj and M. Zaversnik, “An o(m) algorithm for cores decomposition of networks,” CoRR, cs.DS/0310049, 2003.
  • [15] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang, “Online search of overlapping communities,” in SIGMOD, 2013, pp. 277–288.
  • [16] F. Zhang, Y. Zhang, L. Qin, W. Zhang, and X. Lin, “When engagement meets similarity: efficient (k, r)-core computation on social networks,” PVLDB, vol. 10, no. 10, pp. 998–1009, 2017.
  • [17] Z. Zhang, X. Huang, J. Xu, B. Choi, and Z. Shang, “Keyword-centric community search,” in ICDE, 2019, pp. 422–433.
  • [18] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in SIGMOD, 1984, pp. 47–57.
  • [19] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and J. R. Smith, “The onion technique: indexing for linear optimization queries,” in SIGMOD, 2000, pp. 391–402.
  • [20] B. Tang, K. Mouratidis, and M. L. Yiu, “Determining the impact regions of competing options in preference space,” in SIGMOD, 2017, pp. 805–820.
  • [21] S. Borzsony, D. Kossmann, and K. Stocker, “The skyline operator,” in ICDE, 2001, pp. 421–430.
  • [22] J. Liu, L. Xiong, J. Pei, J. Luo, and H. Zhang, “Finding pareto optimal groups: Group-based skyline,” PVLDB, vol. 8, no. 13, pp. 2086–2097, 2015.
  • [23] I. F. Ilyas, G. Beskales, and M. A. Soliman, “A survey of top-k query processing techniques in relational database systems,” ACM Comp. Surveys, vol. 40, no. 4, pp. 1–58, 2008.
  • [24] R. Zhong, G. Li, K.-L. Tan, L. Zhou, and Z. Gong, “G-tree: An efficient and scalable index for spatial search on road networks,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 8, pp. 2175–2189, 2015.
  • [25] Z. Li, L. Chen, and Y. Wang, “G*-tree: An efficient spatial index on road networks,” in ICDE, 2019, pp. 268–279.
  • [26] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “Progressive skyline computation in database systems,” ACM Trans. Database Syst., vol. 30, no. 1, pp. 41–82, 2005.
  • [27] M. D. Berg, O. Cheong, M. V. Kreveld, and M. Overmars, Computational geometry: algorithms and applications.   Springer, 2008.
  • [28] L. Zou and L. Chen, “Pareto-based dominant graph: An efficient indexing structure to answer top-k queries,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 5, pp. 727–741, 2011.
  • [29] P. K. Agarwal and M. Sharir, “Arrangements and their applications,” in Handbook of computational geometry.   Elsevier, 2000, pp. 49–119.
  • [30] J. Cheng, Y. Ke, A. W.-C. Fu, J. X. Yu, and L. Zhu, “Finding maximal cliques in massive networks,” ACM Trans. Database Syst., vol. 36, no. 4, pp. 21:1–21:34, 2011.
  • [31] R. Zhou, C. Liu, J. X. Yu, W. Liang, B. Chen, and J. Li, “Finding maximal k-edge-connected subgraphs from a large graph,” in EDBT, 2012, pp. 480–491.
  • [32] T. Akiba, Y. Iwata, and Y. Yoshida, “Linear-time enumeration of maximal k-edge-connected subgraphs in large networks by random contraction,” in CIKM, 2013, pp. 909–918.
  • [33] L. Chang, J. X. Yu, L. Qin, X. Lin, C. Liu, and W. Liang, “Efficiently computing k-edge connected components via graph decomposition,” in SIGMOD, 2013, pp. 205–216.
  • [34] L. Qin, R. Li, L. Chang, and C. Zhang, “Locally densest subgraph discovery,” in SIGKDD, 2015, pp. 965–974.
  • [35] Y. Wu, R. Jin, J. Li, and X. Zhang, “Robust local community detection: on free rider effect and its elimination,” PVLDB, vol. 8, no. 7, pp. 798–809, 2015.
  • [36] L. Chen, C. Liu, R. Zhou, J. Li, X. Yang, and B. Wang, “Maximum co-located community search in large scale social networks,” PVLDB, vol. 11, no. 10, pp. 1233–1246, 2018.
  • [37] F. Guo, Y. Yuan, G. Wang, L. Chen, X. Lian, and Z. Wang, “Cohesive group nearest neighbor queries over road-social networks,” in ICDE, 2019, pp. 434–445.
  • [38] F. Guo, Y. Yuan, G. Wang, L. Chen, X. Lian, and Z. Wang, “Cohesive group nearest neighbor queries on road-social networks under multi-criteria,” IEEE Trans. Knowl. Data Eng., 2020.
  • [39] L. Chen, C. Liu, K. Liao, J. Li, and R. Zhou, “Contextual community search over large social networks,” in ICDE, 2019, pp. 88–99.
  • [40] Q. Liu, Y. Zhu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “Vac: vertex-centric attributed community search,” in ICDE, 2020, pp. 937–948.
  • [41] J. Luo, X. Cao, X. Xie, Q. Qu, Z. Xu, and C. S. Jensen, “Efficient attribute-constrained co-located community search,” in ICDE, 2020, pp. 1201–1212.
  • [42] A. H. Land and A. G. Doig, “An automatic method of solving discrete programming problems,” Econometrica, vol. 28, no. 3, pp. 497–520, 1960.