Multi-attributed Community Search in Road-social Networks
Abstract
Given a location-based social network, how to find the communities that are highly relevant to query users and have top overall scores in multiple attributes according to user preferences? Typically, in the face of such a problem setting, we can model the network as a multi-attributed road-social network, in which each user is linked with location information and () numerical attributes. In practice, user preferences (i.e., weights) are usually inherently uncertain and can only be estimated with bounded accuracy, because a human user is not able to designate exact values with absolute precision. Inspired by this, we introduce a normative community model suitable for multi-criteria decision making, called multi-attributed community (MAC), based on the concepts of -core and a novel dominance relationship specific to preferences. Given uncertain user preferences, namely, an approximate representation of weights, the MAC search reports the exact communities for each of the possible weight settings. We devise an elegant index structure to maintain the dominance relationships, based on which two algorithms are developed to efficiently compute the top- MACs. The efficiency and scalability of our algorithms and the effectiveness of MAC model are demonstrated by extensive experiments on both real-world and synthetic road-social networks.
I Introduction
Numerous real-world networks (e.g., social networks) are made up of community structures, where discovering them is an essential problem in network analysis. Recently, community search, which is a kind of query-dependent community discovery problem and is designed to find densely connected subgraphs containing query vertices, has drawn much attention among database professionals due to an ever-growing number of applications [1, 2, 3, 4, 5, 6, 7, 8].
At the same time, with the prevalence of GPS-enabled mobile devices, location-based social networks (LBSN) are becoming more diverse and complex in recent years (e.g., Facebook Places, Foursquare). Since not only users and friendships are included, but each user is often associated with various properties, such as location information and numerical attributes. The location information is able to bridge the gap between virtual and physical worlds, while the numerical attributes obtained from user profiles or statistical information derived by various network analytics (e.g., influence, similarity, etc.) can characterize the user. For instance, in a scientific collaboration network such as Aminer, every author may have own (spatial) position/address and several numerical attributes (e.g., h-index, #publications, activeness, diverseness, etc.). Typically, we can model such network data as a multi-attributed road-social network, in which each vertex is linked with location and () numerical attributes.
Given a multi-attributed road-social network, how to identify the query-user-involved communities that are not dominated by the other communities according to user’s preferences for numerical attributes? For instance, considering h-index and activeness in the Aminer network, how to find a group of collaborators who are related to and close to the query users and take the activeness in their research fields in recent years as the main criterion? Similarly, considering #publications and diverseness, how to find a community of query users in the Aminer network such that the members provide a good tradeoff in terms of the number of publications and diverse research interests? It is noting that both the social and spatial cohesiveness of groups are incorporated.
In this regard, we want to develop a normative community model suitable for multi-criteria decision making. Considering that the traditional top- query receives a dataset of records with -dimensional attributes and a weight vector of values assigning the relative importance of each dimension to the user as input, where the weighted sum of attribute values is used as the score of a record, we evolve such scoring method into a multi-attributed community model. Similarly, the weight vector denotes the user preferences and is therefore crucial in generating useful recommendations for the community. Generally, can be directly input by the user or mined from his/her past behavior or choices [9, 10].
Driven by the fact that weight accuracy plays a vital role in the applicability and practicality of top- queries, we argue that the assumption that exact values of a weight vector are known is inherently inaccurate and almost unrealistic. To illustrate this, consider the case of manually assigning weights to the above example. By taking activeness as the main criterion, a user may specify weights 0.2 and 0.8 for h-index and activeness, respectively. At this point, leaving aside the participation of query users and spatial cohesiveness, the weighted sum of these two attribute values can be regarded as the influence (or importance) of the vertex in [4]. On the other hand, based on pure intuition, she may also specify the same weight per dimension (e.g., 0.5 each). The results may be similar to the skyline community [8], since weights are not biased towards any dimension (as verified in our experiments, the skyline community [8] is usually contained in our results). However, it is impractical and unfair for the user to require weights with absolute precision even a slight variation in the weight (e.g., from 0.2 to 0.19; see Example 3) may remarkably change the results. On the contrary, it would be preferable to take user inputs as generic instructions and leave room for inaccuracies in weight setting. Similarly, for the weight vector computed by preference learning techniques, it should serve only as a rough guide instead of an accurate expression of user preferences. Naturally, this issue can be dealt with by expanding the weight vector to a region111There are already preference learning techniques (e.g., [11]) to generate such a region instead of a specific weight vector. and returning all promising results to the user, thus providing a practical and more user-centric design.
Prior work on community search problem has never incorporated the uncertainty of weight vector, thus all existing community search algorithms are unable to answer the above questions. To adequately characterize such interesting communities w.r.t. user preferences, we propose a novel community model called multi-attributed community (MAC) based on the concepts of -core [12] and r-dominance (variant of traditional sense) [13]. An MAC is a maximal connected -core with query vertices contained and spatial cohesiveness satisfied, that is not r-dominated by other connected -cores in terms of -dimensional attributes w.r.t a region of interest (see Section II for detailed definition). Importantly, since the scores of communities all depend on , they are necessarily correlated and vary together as freely lies inside region . As a result, a partitioning of forms the output, in which each partition is associated with the MACs when falls anywhere in that partition. The MAC model is also applicable to many interesting applications, some of which are introduced below.
Personalized optimum community search. In daily life, people always want to discover the optimum community based on different needs. For example, a coach hopes to reorganize the school basketball team around certain players (as query users) to improve offense. In this application, we can limit the query scope to the school and extract three numerical attributes for each player: points, rebounds and assists. By setting the region of user preferences, we can obtain corresponding communities that are not r-dominated by the others. Similar query may also help organizations to analyze customer orientation or perform marketing/promotion activities.
Cohesive groups discovery in LBSNs. In some cases, one may wish to circle the target range by finding cohesive groups to achieve identification, such as COVID-19 precaution and suspect investigation. Given several confirmed cases, possible cases are likely to be within a certain range of them (providing possibility of close contact), and the Jaccard similarity (e.g., hobbies, interests) and influence (e.g., #neighbors) for each user can be extracted as numerical attributes; for the preliminary investigation of a case, given the victim and escape range, motive and #criminal records of the same or different types can be used as numerical attributes. By mining the MACs, we can get the desired cohesive groups related to query users.
Contributions. Efficient solutions are formulated and provided to find multi-attributed communities in road-social networks. Below, we summarize the contributions of this paper.
Novel community model. The MAC model is proposed in road-social networks, which can be used for finding communities not r-dominated by others. To the best of our knowledge, our model is the first to incorporate uncertainty of weight vectors, and our work is also the first to introduce the dominance relationship specific to preferences for community modeling.
New algorithms. An efficient global search, DFS-based algorithm, is first developed to find the top- MACs for each of the possible weight settings in user preferences. The time and space complexity of DFS-based algorithm are bounded by and respectively, where and denote the number of vertices and edges in the maximal -core of a road-social network. To further accelerate the search, we propose a more efficient local search framework, of which two striking features are that (1) its time complexity is much lower than that of global search (at least one order of magnitude faster in practice), and it can find all non-contained MACs in most cases; and (2) it enables the MACs to be output progressively during execution, which is very beneficial to applications expecting only part of the MACs.
Extensive experiments. To demonstrate the high efficiency and scalability of our proposed algorithms and to evaluate the effectiveness of the MAC model, we conduct extensive experiments and a comprehensive case study on both real-world and synthetic road-social networks, in which many significant and interesting communities are able to be discovered.
II Problem Statement
In this section, we formally introduce the multi-attributed community in road-social networks and its search problems. Table I summarizes the notations used throughout this paper.
II-A Preliminaries
Road network. We model a road network as an undirected weighted graph , where (resp. ) is the sets of vertices (resp. edges). A vertex represents a road intersection/end. An edge represents a road segment allowed to travel between vertices and , and is associated with a non-negative weight that represents its cost (e.g., distance/travel time). Let be a spatial point lying on edge , and be proportional to the length from to . By we refer to the network distance between points and in , which is the sum of edge weights along the least costly (i.e., shortest) path from to .


Social network. We model a social network with numerical attributes as an undirected graph , where and denote the sets of vertices (i.e., users) and edges (i.e., social relations) respectively, and are the sets of mappings and -dimensional vectors defined on respectively. Specifically, for each vertex , provides a mapping of each user’s location (i.e., spatial point) in the road network; and is also associated with a -dimensional vector with real values, denoted by , where and .
Road-social network. A road-social network is a pair of graphs, denoted by , where is a road network and is a social network. Each user is associated with a spatial point in , i.e., , indicating the current location of . We assume that can either be on a vertex or edge of .
Notation | Description |
---|---|
query vertices and number of MACs to select among | |
coreness threshold and query distance threshold | |
dimensionality of attributes and preference domain | |
degree of vertex in and minimum degree of | |
neighbor set of vertex in subgraph | |
mapping of user ’s location (i.e., spatial point) in | |
-dimensional vector with real values of user | |
query distance and score of subgraph resp. | |
the maximal -core and r-dominance graph | |
subgraphs of induced by and resp. | |
vertices in the bottom layer of and top layer of |
For example, Fig. 1 displays a road-social network, where vertices represent users and road junctions, and edges represent social relations and road segments, respectively. For simplicity, the location of user in Fig. 1(a) is on the vertex of the road network in Fig. 1(b). Fig. 2(a) shows the values of part of vertices in three different dimensions.
II-B (k, t)-Core
We introduce a novel densely connected substructure, called -core, by focusing on the following structural cohesiveness and communication cost.
Structural cohesiveness. In order to model the structural cohesiveness, the generally used -core model is adopted to indicate the communities [12, 14, 1, 2, 4]. In particular, by we refer to the degree of vertex in social network . Let be an induced subgraph of , where , , and .
Definition 1
(-core.) Given a graph and an integer , is a -core of if each vertex has a degree at least , i.e., .
The maximal -core is the one satisfying that no super -core containing it exists. We refer to the maximal in all -cores containing vertex as the core number of . In order to avoid confusion, we denote a connected -core as a -, since the maximal -core is not necessarily connected.
Remarks. Although we use -core as the structural cohesiveness metric, our techniques can also be applied to other criteria such as -clique [15] and -truss [3].
Communication cost. For two users , the length of the shortest path between their locations in is denoted by , which is equal to if and are not connected. To model the communication cost in , we utilize the notion of query distance below.
Definition 2
(Query Distance.) Given a graph and query vertices , , the query distance of is the maximum length of the shortest path from to in , denoted by ; the query distance of is defined as .
By query distance , the communication cost between query vertices and the members in can be measured. In general, a good community is considered to own a low communication cost, i.e., small . Consider the query vertices in Fig. 1. The query distance of is . The query distance of the subgraph induced by is equal to . In the following, we propose a new notion of -core, by adapting the concepts of -core and query distance, to capture dense structural cohesiveness and low communication cost.
Definition 3
(-core.) Given graphs , query vertices , and numbers and , is a -core iff is a - of containing and in .
For a -core, its structural cohesiveness increases with , while proximity to query vertices decreases with . For instance, in Fig. 1, the subgraph induced by for is a -core with and .
II-C Score of Multi-Attributes
Note that generalizing the existing community models, most of which are only for -dimensional attribute such as influence [4], Euclidean distance [6] or keyword similarity [5, 7, 16, 17], to a comparable one with multi-dimensional attributes is nontrivial. Unlike the above, comparing two communities becomes quite tough if either can have () values due to different dimensions. On the other hand, existing multi-dimensional model [8] cannot compare the pros and cons of any skyline communities and make trade-offs based on user preferences. To overcome the above issues, we introduce the r-dominance relationship between two communities, that will be developed for defining our multi-attributed community model.
As each vertex associated with a vector , the attributes define a -dimensional data domain. We assume that for each attribute a higher value is preferable, and a spatial index (e.g., R-tree [18]) is used to organize the vector set .
Referring to the traditional top- queries, the score of a vertex w.r.t can also be derived by inputting a weight vector as
(1) |
Thus, the top- results consist of the vertices with the highest scores. We assume w.l.o.g. that for and . Such conditions cannot restrain user preferences, because score ranking depends only on the direction of instead of its magnitude [19]; but they allow to drop one weight (i.e., ), thereby mapping the domain of to a ()-dimensional space, named the preference domain. This dimensionality reduction is critical [20], since the dimensionality directly determines the processing time of the costliest operations in our techniques (see Section V-B). In the following, refers to the ()-dimensional form of the weight vector. For example, given weights 0.2 and 0.3 for and respectively in Fig. 2(a), we have , and .


Community score. Let be an induced subgraph of . Given a weight vector , we define score of as
(2) |
Here, we have a brief discussion on why the “min” operator in Eq. 2 is used to define . The intention is the same as that in [4]. By using it, all the members in can be ensured to have a score in terms of dimensions no less than . In other words, if is large, vertices in may not have large values in each dimension but the weighted sum of their attribute values must be large. For instance, consider the previous case of taking the activeness as the main criterion (e.g., weight 0.8) in the collaboration network. Obviously, we can apply the defined above to quantify the overall quality focusing on the activeness of a group of collaborators.
Definition 4
(r-dominance.) Given a region in the preference domain, a community r-dominates another community when for any weight vector in , denoted by ; and denotes that vertex r-dominates another vertex .
Intuitively, in the traditional sense of dominance [21, 22], for any weight vector the dominator is always superior to the dominee, while r-dominance is specific to weight vectors in . Although a community may not dominate another based on the skyline community model [8], it might always have a higher score as is bounded in . For ease of representation, we assume that is a hyper-rectangle parallel to axes, but our techniques is directly applicable to general convex polytopes.
For example, Fig. 2(b) illustrates a preference domain (i.e., the domain of weight vector) with , where axis (resp. ) corresponds to the weight for (resp. ) on a scale of 0 to 1, and the region is a convex polygon (i.e., an axis-parallel rectangle ) representing the expanded or approximated user preferences.
II-D The MAC Problem
Combining the notion of -core and community score, we define the multi-attributed community (MAC) as follows.
Definition 5
(MAC.) Given graphs , query vertices , two numbers and , and a region of interest , is a multi-attributed community in the road-social network, if satisfies the following conditions:
-
1.
is a -core containing .
-
2.
There does not exist an induced subgraph such that is a -core and .
In terms of structural cohesiveness and communication cost, condition (1) requires not only that the community with query vertices contained is densely connected, but also that each vertex is spatially close to . In terms of maximality and multi-attributed community score, condition (2) ensures that the community is maximal and having the greatest score. The following example illustrates Definition 5.
Example 1
Consider , numerical attributes and a region in Fig. 1 and 2, respectively. Suppose, for instance, that , and . By Definition 5, the subgraph induced by is an MAC with community score equal to if freely lies anywhere in the upper-left part of (divided by dotted line), as it meets both conditions. Note that the subgraph induced by is not an MAC. This is because it is contained in the MAC induced by with the same community score, thus fails to satisfy condition (2).
For most practical applications, we typically tend to focus on the query-user-involved communities, which score higher than (i.e., r-dominate) all other communities in the preference domain . In this paper, we aim to efficiently discover such communities in road-social networks. Below, two multi-attributed community search problems are formulated.
Problem 1. Given graphs , query vertices , two numbers and , and a region of interest , the problem is to find the top- MACs with the highest community score for each possible weight vector in . Although the possible weight vectors are infinite in , a partitioning of can form the output, in which each partition is associated with the top- MACs when falls anywhere in that partition.
For Problem 1, an MAC may be contained in another MAC in the top- results.
Example 2
Assume that , and in Fig. 1. The top- MACs are subgraphs and induced by and for any weight vector in , respectively. , and or on either side of the dotted line. Clearly, contains .
To eliminate the containment relations in the top- results, we study another problem of finding the non-contained MAC.
Definition 6
(non-contained MAC.) An MAC is a non-contained MAC if there does not exist an induced subgraph such that is a -core and .
The following example illustrates Definition 6.
Example 3
Problem 2. Given graphs , query vertices , two numbers and , and a region of interest , the problem is to find the non-contained MAC for every possible weight vector in its corresponding partitioning of .
Discussions. Another three possible operators “max”, “sum” and “avg” are not appropriate for community score. The first two are monotonic w.r.t. the size of community, that is, a community scores higher than its sub-communities. Hence, the answer is always the maximal -core, which is independent of numerical attributes and region . The last one may cause outliers in the answer, e.g., only a few vertices score very high while the rest score low, resulting in a higher community score. Obviously, this is not an ideal community.
Challenges. Solving the above two problems faces three major challenges. First, the number of -s containing in a multi-attributed network can be exponentially large (even regardless of the query distance in ). Thus, enumerating all the -s to identify the MACs is intractable. Second, unlike traditional top- queries [23], the score of a community may vary greatly at different parts of , making it nontrivial to draw conclusions about r-dominance relationships between communities. Third, the MAC model enables a more flexible way to express user preferences in community search problem, which means that inherent inaccuracies in weight specification need to be taken into account. In consequence, without enumerating all the -cores, devising an efficient algorithm to detect the MACs is challenging. To overcome these challenges, we will develop efficient algorithms in the following sections.
III Warming Up for Our Methods
According to Definition 5, regardless of maximality and community score, the multi-attributed communities have to satisfy the constraints of structural cohesiveness and communication cost. Thus, we give two useful lemmas as follows.
Lemma 1
For a number , the vertices of with query distance greater than in cannot exist in any MAC.
Lemma 2
For an integer , the MACs must be contained in the maximal - containing .
The correctness of above lemmas can be verified by Definition 2 and the maximality of -core, resulting in Lemma 3.
Definition 7
(Maximal -core.) For graphs and query vertices , the maximal -core is a -core such that no super -core contains it, denoted by .
Lemma 3
For two numbers and , the MACs must be contained in the maximal -core.



Referring to Lemma 3, for a given and , we first filter out the vertices of that do not satisfy the query distance threshold by range query in , which can be accelerated by G-tree [24] or G*-tree [25], to obtain the induced connected subgraph by the remaining vertices. Next, we do -core decomposition [14] on the filtered social subgraph to compute the maximal - containing . It is noting that we employ the upper bound of coreness [2] before core decomposition. If is larger than , we immediately know there is no - w.r.t . So far, the maximal -core has been found such that the MACs can be obtained through in-depth computation. For example, the maximal -core, i.e., , for is the subgraph induced by , as shown in Fig. 4(a).
After abandoning the scheme of enumerating all the -cores whose number can be exponentially large even in , intuitively, we may think of iteratively deleting the smallest-score vertex w.r.t. -dimensional attributes until the resulting graph does not have a - containing . However, at this time we will face another problem, that is, which vertex has the smallest score? To address this issue, we design an effective data structure and construction algorithm to preserve r-dominance relationships between vertices.
IV R-Dominance Graph
In this section, we exploit the r-dominance graph to preserve pair-wise r-dominance relationships between vertices in , which will be used in our proposed search algorithms.
IV-A R-Dominance Test
Consider two vertices and where none dominates the other in terms of traditional dominance, that may not draw a reliable conclusion about which vertex ranks higher. Nevertheless, given a preference domain, the inequation (resp. equation ) corresponds to a half-space (resp. hyperplane), of which there are three different cases regarding the positioning against [13]. Specifically, in Fig. 3(a), r-dominates since half-space completely covers , which means scores higher for ; the case in Fig. 3(c) is symmetric. In Fig. 3(b), scores higher in one part of but lower in another, which is called r-incomparable as none r-dominates the other.
Clearly, the cases in Fig. 3(a) and 3(c) allow r-dominance conclusions to be safely drawn. In this way, we can determine r-dominance by detecting whether all polygon vertices defining fall into the half-space . If so (resp. not), r-dominates (resp. is r-dominated by) . Otherwise, and are r-incomparable. The inclusion detecting of each polygon vertex costs , so the r-dominance test requires in total, where is the number of polygon vertices defining .
IV-B Pair-Wise R-Dominance Relationship
The computation of r-dominance relationships is somewhat similar to that of the -skyband (i.e., BBS [26]), but differs as follows. (1) Rather than traditional dominance, we employ r-dominance and apply its test described in Section IV-A both for vertex-to-vertex and vertex-to-MBB (i.e., minimum bounding box) dominance testing. (2) Due to the fact that is bounded in , we adopt a unique sorting key for R-tree nodes (represented by MBB’s upper-right corner) and vertices in the heap to accelerate search convergence by leading it to r-dominate as many members as possible first. (3) We preserve the r-dominance relationships between vertices in instead of only the top- layers. It is noting that a max-heap is utilized in our adapted BBS and its sorting key is the score of R-tree node/vertex w.r.t. a pivot vector of , whose value of each dimension is the mean of the polygon vertices of in that dimension [13]. The correctness of our adaptation is guaranteed as follows. (1) The pivot vector must lie in due to ’s convexity [27]. (2) Vertices popped after cannot r-dominate due to pivot-based sorting (in decreasing order).
In addition, we adopt a directed acyclic graph (DAG) [28, 22] to maintain all pair-wise r-dominance relationships between vertices in , named r-dominance graph (denoted by ). Fig. 4(b) illustrates of . An arc from vertex to signifies that r-dominates . It is noting that an arc from or to is not needed as the transitivity of r-dominance relationship already implies this. The number of vertices that r-dominate is called ’s r-dominance count.


V Global Search
In this section, we develop a global search algorithm for Problem 2 and discuss its generalization for Problem 1. Before proceeding further, three useful lemmas are given as follows.
Lemma 4
The maximal -core, i.e., , is an MAC.
Lemma 5
For any MAC, if we delete the smallest-score vertex w.r.t. any weight in and the resulting subgraph still has a - containing , is an MAC.
Lemma 6
For any MAC , if we delete the smallest-score vertex w.r.t. any weight in but the resulting subgraph does not have a - containing , is a non-contained MAC.
In view of the above lemmas, we can devise an efficient algorithm based on depth-first search (DFS) for our problems.
V-A The DFS-based Algorithm
The idea of the DFS-based algorithm is described in detail in Algorithm 1. First, for given and , we compute the maximal -core, i.e., , and build the r-dominance graph . Then, the following procedure is iteratively invoked until the resulting graph in each partition of does not have a - containing . The procedure consists of two steps. Let and be the resulting subgraph and corresponding r-dominance graph in partition of (Line 6). The first step is to insert sub-partitions into according to (Lines 7-9), then find the smallest-score vertex in each sub-partition (Lines 10-11). In Line 9, note that is the root node of a binary tree after being passed in as a parameter, and is a set of leaf nodes of the binary tree, representing sub-partitions of . This step is essential and will be elaborated in Section V-B. The second step is to delete all the vertices that are definitely excluded in subsequent MACs, which enables and to be updated accordingly (Lines 15-20).
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/5f8eb5f1-aa05-4d78-a3bf-d6c54aea22d9/x10.png)
In particular, Algorithm 1 recursively deletes all the vertices violating the structural cohesiveness constraint by a DFS procedure (Lines 16-19) as long as the smallest-score vertex is found in the sub-partition. Because the degrees of the adjacent vertices of all reduce by 1 when we delete . This may cause some neighbors of to violate the structural cohesiveness constraint and thus cannot be contained in subsequent MACs. Likewise, we also need to verify whether the neighbors of other hops (e.g., 2-hop, 3-hop, etc.) meet the structural cohesiveness constraint. Obviously, the DFS procedure can be used to identify and delete all these vertices.
According to Lemma 6, we have a corollary as follows.
Corollary 1
Given an MAC , is a non-contained MAC if the smallest-score vertex meets one of the following conditions: (1) ; and (2) deleting will recursively disconnect (e.g., being deleted) or make the degree of remaining vertices less than .
Note that we always consider the early termination conditions of Corollary 1 in conjunction with the DFS procedure. Once either is met, it means that vertex cannot be deleted, even if is currently the one with the smallest score but is already the non-contained MAC w.r.t. partition (Line 12). As a result, if Corollary 1 (i.e., Lemma 6) holds, the top- MACs can be obtained by the union of top vertices in heap (totally backtracking times) and the last subgraph (Line 13). Based on Lemma 4 and 5, we can easily verify that for each partition recursively obtained in Line 6 is an MAC. Thus, Theorem 1 shows the correctness of Algorithm 1.
Theorem 1
Algorithm 1 correctly finds the top- MACs.
Proof Sketch: For any partition , as long as its current subgraph does not hold Corollary 1, will be divided into sub-partitions by hyperplanes (Line 10 in Algorithm 1), e.g., for . Here, is discarded when the recursion proceeds to the promising sub-partitions of the local arrangement, i.e., . Assume that is the smallest-score vertex in w.r.t. , the resulting subgraph, denoted by , obtained by invoking DFS procedure (Line 12 in Algorithm 1) can be claimed as the maximal - of the subgraph by contradiction, because DFS procedure recursively deletes all the vertices in whose degrees are smaller than . Therefore, we have and w.r.t. for . In this way, the non-contained MAC corresponding to each final sub-partition can be found, as well as all the MACs. Thus, we conclude that Algorithm 1 correctly finds the top- MACs.
We analyze the complexity of Algorithm 1 in Theorem 2.
Theorem 2
The time complexity and space complexity of Algorithm 1 are bounded by and respectively, where and denote the number of vertices and edges in .
Proof:
The key factor determining Algorithm 1’s time complexity is the construction of arrangements. In the worst case, vertices in are r-incomparable to each other, i.e., the complete arrangement of half-spaces needs to be constructed, in time [29]. The algorithm only needs to store , and maintains the heap and half-space information related to , which uses less than space complexity even in the worst case. ∎


V-B Arrangement Jointing and Indexing
In Algorithm 1, to find the smallest-score vertex for any weight vector in partition , we consider the vertices of in a bottom-up manner. In other words, leaf vertices of the r-dominance graph will be preferred (Line 7). The reason is that if a vertex is deleted either because it does not satisfy the structural cohesiveness constraint (already considered in the DFS procedure), or because it is the smallest-score vertex, but before this, all vertices it r-dominates should be deleted first.
The verification of a leaf vertex in entails partitioning by half-spaces , each corresponding to one of the remaining leaf vertex . Formally, an arrangement bounded by is defined by the supporting hyperplanes of these half-spaces, where each cell (i.e., sub-partition) is located in a set of half-spaces. The leaf vertices corresponding to these half-spaces are precisely those with scores higher than if falls in that cell, which means is the smallest-score vertex.
Consider in Fig. 4(b). Initially, the leaf vertices are , and (, Line 6 in Algorithm 1). Their respective half-spaces , and are inserted into the arrangement, as shown in Fig. 5(a). The vertex in brackets for each partition indicates the smallest-score vertex for any weight vector in that partition. For partitions and on the right, vertex will also be deleted by the DFS procedure due to the deletion of , after which and are both pushed into the heap w.r.t. the partitions (representing the vertices ignored). Thus, the resulting subgraph induced by is the corresponding non-contained MAC since discarding any vertex will no longer satisfy the structural cohesiveness constraint. By backtracking the top vertices in once (i.e., and ), we can easily obtain the second-ranked MAC induced by in and (refer to in Fig. 2(b)).
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/5f8eb5f1-aa05-4d78-a3bf-d6c54aea22d9/x13.png)
When Corollary 1 does not hold, sub-partition will be pushed into queue to compute the non-contained MAC in depth. At this time, vertices ignored will be discarded to update resulting subgraph and corresponding , so that new leaf vertex can be designated in the next round of verification and the half-spaces against other leaf vertices are inserted into the local arrangement. Consider in Fig. 5(a), we refer to as the new leaf vertex after is deleted, i.e., and are induced by and . Then, new half-spaces and are inserted into a newly initialized local arrangement against partition , where three sub-partitions , and are produced, as shown in Fig. 5(b). As and are pushed into together, the non-contained MAC induced by is returned for each sub-partition. Likewise, same operation applies to partition . Eventually, the solution in Example 3 is obtained. Note that we can directly locate and for since no new half-space needs to be computed due to the same leaf vertices (in ) as . This drastically reduces repetitive computation in half-spaces, each of which is computed only once if necessary.
Specifically, for each local arrangement considered, an index is built by a recursive process Partition (Algorithm 2). Then, the index is discarded when all relevant half-spaces are inserted, leaving only the hopeful sub-partitions of the local arrangement (if any). Note that, for any index of (Line 9 in Algorithm 1), the total cost of inserting the -th hyperplane is [20]. In addition, optimization of arrangement indexing and maintenance is the same as described in [13].
VI Local Search
Although the efficiency of the global search algorithm is considerable for each query, it may need to explore the entire maximal -core, especially when query vertices are located at the upper layer of the r-dominance graph . In this section, we devise the local search algorithms for Problem 2 and investigate the generalization for Problem 1.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/5f8eb5f1-aa05-4d78-a3bf-d6c54aea22d9/x14.png)
The intuition is that the non-contained MACs for are in the vicinity of . Thus, the entire should be unnecessary to involve during the search. Nonetheless, it is intractable to enumerate all the -s containing , whose number can be exponentially large w.r.t. size. Accordingly, we only immerse in finding the communities that are most likely to be candidates for non-contained MACs. Their validity and corresponding partitions in can be quickly verified by alone. This inspires us to develop a framework of more efficient local search (Algorithm 3). Specifically, Expand procedure finds candidates (i.e., ) by selecting the most promising vertex as we explore in the neighborhood of , and stops when each target community forms a -. Verify procedure provides guarantee of identifying all valid non-contained MACs w.r.t. from . Note that the time complexity of Algorithm 3 is bounded by (see Theorem 3 and 4), which is much lower than that of Algorithm 1 ( and are typically very small in practice). As verified in our experiments, local search is at least one order of magnitude faster than global search, and all non-contained MACs will be expanded by our candidate generation strategies in most cases.
In this way, two problems arise: (1) how do we guarantee that the target community can be a candidate for non-contained MACs (at least form a - containing ); and (2) how do we know whether the - is a valid non-contained MAC w.r.t. ? The former poses a great challenge of determining which vertex to choose and when to terminate expansion, and the latter requires an in-depth study of the characteristics of non-contained MACs. In the following, we present lemmas and algorithms for the local search strategy.
VI-A Candidate Generation
To be a candidate for non-contained MACs, the current community must be at least a - containing . It would be nice if the structural cohesiveness metric (i.e., -core) is “monotonic”, which means that the larger the community, the smaller its minimum degree. So once the minimum degree drops below the given coreness threshold , we can stop the search. Unfortunately, such a metric is not monotonic. Thus, the greatest challenge is to overcome non-monotonicity first, which motivates us to conduct community search only by exploring local neighborhood of .
Now for the minimum degree of a subgraph , denoted by , we make an in-depth analysis of its monotonicity. Consider the exploration starting from the query vertices, i.e., induced by . We add a vertex from at each step until a - is obtained, assuming that is a vertex sequence it adds. So let be induced by . In general, is a non-monotonic function of . More formally, is unnecessarily greater than . However, the order of vertices added to determines the monotonicity of . Interestingly, for any with , we can always find a sequence of added vertices such that is a non-decreasing function of .
Lemma 7
For any query vertices with in graph , there always exists an added vertex sequence of starting with such that .
Proof Sketch: It is consistent with proving that vertices can be removed one by one from until , just ensuring that each removal of does not increase the minimal degree of the remaining vertices. Otherwise, it occurs only when is currently one of the vertices with the minimal degree. The reason is that removing a vertex with non-minimal degree will only preserve or decrease the minimal degree.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/5f8eb5f1-aa05-4d78-a3bf-d6c54aea22d9/x15.png)
Lemma 7 implies that there always exists an exploration order that leads monotonically to a community of - containing . This can be generated by a sequence of vertices added to starting from so that for each . Note that the existence of such an order is a necessary but insufficient condition for finding a valid community. To illustrate the insufficiency, consider and in Fig. 1(a). Any vertex sequence starting with cannot yield a valid solution, yet is greater than .
In Expand procedure (Algorithm 4), we explore from the vicinity of by BFS and generate candidate set . To converge the current community towards a candidate for non-contained MAC, we develop two intelligent candidate selection strategies according to Lemma 3, 6 and 7. The idea of improving candidate generation is to use priority queues such that the most promising vertex can be selected to rapidly generate a candidate. From the perspective of structural cohesiveness, the priority of a vertex can be defined as or , where . emphasizes the improvement of minimum degree for the next step, with or for any ; produces the fastest increase in average degree of so that the minimum degree will increase with ’s density growth. From the perspective of community score, the priority of can be defined as , where is a constant (maximum priority in ) and denotes the layer of in . drives community score higher by adding a vertex that r-dominates as many vertices as possible. To sum up, the priority is defined as
(3) |
where is a trade-off against , or
(4) |
VI-B Verification
The determinant of global search is that computing the local arrangement of all half-spaces among leaf vertices is a relatively expensive process (in time [29], where is the number of half-spaces). Instead, in Verify procedure (Algorithm 5), an empty arrangement in is initialized, into which half-spaces w.r.t. a carefully selected and therefore very small subset of vertices (i.e., competitors below) are inserted, expecting to securely confirm or disqualify candidate without considering all other vertices. But before this, we first give a corollary to filter out unpromising candidates from . Note that represents the r-dominance graph corresponding to , which is a subgraph of induced by , denoted by ; and represents the rest of , denoted by .
Corollary 2
A community can be discarded if one of the following conditions is met: (1) , is a non-leaf vertex in ; and (2) and , cannot be recursively deleted by deleting vertices in .
Proof Sketch: Vertices in must be deleted if holds.
In other words, either there exists a non-cross-layer arc in between a non-leaf vertex and a leaf vertex, or can be recursively deleted by the DFS procedure. We refer to the that holds Corollary 2 as a promising community.
Lemma 8
For any promising , is regarded as an anchor if still forms a - after a (non-) leaf vertex in is deleted.
Now we discuss the verification process by half-space insertion, in which it may further benefit from the r-dominance relationships stored in as follows.
Lemma 9
Consider , , and their half-space inserted into the arrangement. Assume that is a vertex r-dominated by , and is a partition in the arrangement not covered by . Thus is guaranteed to r-dominate in partition .
Proof:
First, from the definition of half-space , holds anywhere outside . Second, holds anywhere inside by the definition of r-dominance. Thus, we derive that for each partition of the arrangement outside , , that is, r-dominates in . ∎
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/5f8eb5f1-aa05-4d78-a3bf-d6c54aea22d9/x16.png)
Based on Lemma 9, competitors chosen are the vertices in the bottom layer (leaf vertices) of and in the top layer (with r-dominance count 0) of , denoted by and respectively. The fundamental is that intuitively such competitors are the strongest, which are most likely to assist in disqualifying invalid candidates w.r.t. in turn.
Corollary 3
For any promising , it is a valid non-contained MAC if a partition exists in such that all vertices in score higher than those in , and the corresponding conditions are met:
-
1.
If Lemma 8 holds, additionally, all anchors need to score higher than other leaf vertices in .
-
2.
, if can be recursively deleted by the DFS procedure starting from (namely, is bound), then is updated where is ignored and replaced by the vertices of its next layer in .
-
3.
, if and are bound to each other, then vertices in only need to score higher than or .
Proof:
According to Corollary 2, all vertices in have to be deleted when is a promising community. In other words, vertices in are either recursively deleted by deleting those in the lower layer of , or deleted individually due to their lower scores. First, we consider the latter case. As a sufficient and necessary condition, these vertices just need to score lower than those in ; that is, only vertices in score higher than those in . Then, we consider the former case. That is, if condition (2) holds, it means that the restriction on half-spaces between the competitors can be relaxed by the newly updated vertices in , since such a vertex will be deleted anyway. Similarly, if condition (3) holds, the restriction on half-spaces between the competitors can also be relaxed through such vertices, because deletion of one will also lead to deletion of the other. On this basis, becomes a non-contained MAC when Lemma 8 does not hold; otherwise, condition (1) has to be satisfied. This is because such an anchor can still be deleted as it is a non- leaf vertex in , i.e., possibly the current smallest-score vertex in . Putting it all together, we ensure the correctness of the partition corresponding to the non-contained MAC (if any in ). ∎
To illustrate Algorithm 5, let us reconsider Example 2. Assume that by Algorithm 4 we have three promising communities , and , where , and . For , and . As and are bound to each other in (condition (3) met), we only insert and into in Fig. 5(b), and choose the partitions covered by either of them. As a result, is a valid non-contained MAC for any weight vector of in Fig. 2(b). Similarly, is also valid w.r.t. (condition (2) met) but is invalid (condition (1) met) as its partition is outside .
Theorem 4
The time complexity of Algorithm 5 is , where and denote the number of candidates and the number of promising communities in respectively, and is the product of and . The space complexity is bounded by .
Finally, we can simply generalize local search for Problem 1. As the non-contained MAC with corresponding partition is known, we insert sub-partitions into according to (in a up-bottom manner) and add the highest-score vertex to . As long as the current contains an MAC, it will be output. The process terminates until all top- MACs in are acquired. Consider and its , half-spaces among vertices in (i.e., ) are inserted first into . Then is added to for and shown in Fig. 5(b). As r-dominates , we have the second MAC induced by in and . The same applies to and .
VII Experiments
Dataset | Vertices | Edges | |||
---|---|---|---|---|---|
San Francisco (SF) | 175K | 223K | 2.55 | 8 | - |
Florida (FL) | 1.1M | 1.4M | 2.53 | 12 | - |
Slashdot | 79K | 0.5M | 13 | 2,507 | 85 |
Delicious | 536K | 1.4M | 5 | 3,216 | 34 |
Lastfm | 1.2M | 4.5M | 7 | 5,150 | 71 |
Flixster | 2.5M | 7.9M | 6 | 1,474 | 69 |
Yelp | 3.6M | 9.0M | 5 | 10,433 | 129 |
Comprehensive experiments are conducted to evaluate the proposed model and four algorithms, named GS-T, GS-NC, LS-T and LS-NC respectively. GS-T and GS-NC (resp. LS-T and LS-NC) are global search algorithms (resp. local search algorithms) used to compute the top- MACs and the non-contained MACs, respectively. Note that either of the two candidate selection strategies in Section VI-A can be adopted in LS-T or LS-NC, and we just give the results by using Eq. 3 with and (results by using Eq. 4 are similar and omitted to save space). All algorithms were implemented in C++, and all experiments were conducted on an Ubuntu server with 2GHz Intel Xeon E7-4820 CPU and 1TB memory.
Datasets. We use five real-world social networks222http://networkrepository.com,333https://www.yelp.com/dataset and two road networks (SF444https://www.cs.utah.edu/~lifeifei/SpatialDataset.htm/FL555http://www.dis.uniroma1.it/challenge9/index.shtml) in our experiments. Table II summarizes the statistics of datasets, of which , and denote the average degree, the maximal degree and the maximal core number, respectively. Note that numerical attributes are not contained in the first four original social networks2, for which we employ a widely used method in [21] to generate three different types of numerical attributes, i.e., independence, correlation and anti-correlation. Due to space limit, we report the results obtained from datasets with independent and real attributes. In addition, we map each user to a spatial point in the road network that matches the scale of his/her social network as follows: we project SF/FL into range in each dimension and generate normalized by drawing from a list of recent check-ins. Assume that is the current location of if it has the smallest Euclidean distance to in the projection space.
Parameter | Tested values |
---|---|
4, 8, 16, 32, 64 | |
(SF/FL) | 600/800, 800/1000, 1000/1200, 1200/1400, 1400/1600 |
2, 3, 4, 5, 6 | |
1, 4, 8, 16, 32 | |
5, 10, 20, 40, 60 | |
0.1%, 0.5%, 1%, 5%, 10% |
Parameters. We vary parameters: structural cohesiveness , query distance , dimensionality , number of query users , number of top MACs, and percentage of axis length (i.e., side-length of ). Table III shows the range of parameters and their default values (in bold). For each value of , we randomly select sets of query vertices, that satisfy and can ensure the existence of the maximal -core, from the -core of each social network. In each experiment, only one parameter varies and the rest remains at the default. Every reported measurement is the average of MAC searches for ten randomly generated axis-parallel hypercubes in the preference domain.
VII-A Performance Evaluation






























Exp-1: Varying . We evaluate the query processing time of all algorithms and the number of vertices of by varying . In each road-social network, we can see that local search performs better than global search, but the advantage becomes less obvious when increases. In the best case, e.g., , LS-T and LS-NC are more than one order of magnitude faster than GS-T and GS-NC in Fig. 7(a) and 9(a). Only when , global search is comparable to local search since size shrinks when is large (see Fig. 11(c)), resulting in a reduction in the time complexity of global search and in the number of promising vertices involved in local search. Although Delicious is larger than Slashdot, but algorithms run faster at and since contains fewer vertices (as in Table II). Note that when increases from to , local search is merely more than twice as fast, e.g., LS-NC takes s and s respectively in Fig. 7(a). This is because when is relatively small, candidate selection strategies tend to find more smaller candidates. However, algorithms run faster in Yelp than in Flixster, yet its size is larger. This will be explained in Exp-6. In short, across a wide range of local search is consistently better than global search.
Exp-2: Varying . We evaluate the query processing time of all algorithms by varying . For each road-social network, local search outperforms global search significantly, with advantage becoming more obvious as increases. For example, in Fig. 9(b), GS-NC takes s while LS-NC takes s for . Note that the results are obtained in the case of , thus generally LS-T and LS-NC are almost one order of magnitude faster than GS-T and GS-NC in terms of . Because when is large, more users are retained via range query accelerated by G-tree [24] or G*-tree [25]. This favors local search radiating outward from , equivalent to increasing the expansion radius.
Exp-3: Varying . By varying we study the query processing time and the memory overhead of all algorithms, and comparison of different methods. The toughness of MAC search rises with due to its computational geometric nature. Nonetheless, all four algorithms offer practical processing time, taking respectively s and s for in Fig. 9(c). Furthermore, Fig. 11(d) shows the memory overhead of GS-NC/LS-NC and BBS process ( indexed and built). When increases, dimension of R-tree increases but memory overhead changes not much due to unchanged size; local search is very lightweight against global search. For example, GS-NC takes MB and MB while LS-NC takes only MB and MB for and , respectively. The results confirm both the theoretical analysis and the claims on arrangement indexing in Section V-B. In addition, Fig. 13 and 14 show the comparison with methods in [4] and [8], where Influ (resp. Influ+) is the DFS-based (resp. ICP-index based) algorithm, and Sky (resp. Sky+) is the basic (resp. space-partition) algorithm. We implement Influ and Influ+ by varying instead of since they can only capture -dimensional attribute. For a fair comparison, weight vectors that fall anywhere in are randomly selected to respectively calculate the weighted sum of (at the default) numerical attributes as vertex influence (i.e., score), and the average processing time is reported in Fig. 13(a) and 14(a). Since no r-dominance graph needs to be maintained and no half-space has to be computed nor inserted, Influ and Influ+ are superior to GS-NC and LS-NC in terms of processing time, while Sky and Sky+ are generally the most expensive due to their time complexity. On the other hand, in terms of , Sky and Sky+ are much costlier than ours and intractable when is relatively large, e.g., and respectively in Fig. 14(b). Here, “Inf” means processing time exceeds s. Therefore, our model and algorithms are tractable and scalable to handle real-world applications comprehensively and flexibly.
Exp-4: Varying . We evaluate the query processing time of all algorithms and the ratio of non-contained MACs (NC-MACs) found by LS-NC to GS-NC by varying . All processing time almost monotonically decreases with the growth of since it accelerates the convergence of both global and local search. The reason is that increasing means more vertices cannot be deleted in global search and selecting fewer vertices may find a candidate in local search. Note that in Fig. 7(d), as , algorithms are not much faster at than . In fact, we also find that global search terminates early if the generated query vertices are located in the lower layer of because it is more likely to encounter when deleting vertices, while local search is just the opposite. Fig. 12(a) and 12(b) show the ratio against and , respectively. We can see that both the ratios decrease with the growth of and , but it is still satisfactory. The reason is that the number of community candidates expanded by the Expand procedure tends to decrease with the increase of or , but the Verify procedure can always ensure the correctness of the corresponding partitions in of any (promising) community. On the other hand, this proves that local search is very useful for applications expecting only part of MACs. Note that, in practice, is usually not very large and the ratio reaches at default , which confirms that local search can find all non-contained MACs in most cases.












Exp-5: Varying . We examine the query processing time of GS-T and LS-T by varying . The curve of GS-T is rising slowly with increasing since the top- MACs can be directly obtained after executing global search. But for LS-T, after obtaining the non-contained MAC, its corresponding cell has to be divided again to find the top- MACs, resulting in an increase in processing time.
Exp-6: Varying . We evaluate the query processing time of all algorithms, and the number of partitions and non-contained MACs by varying (determining the size of region ). As anticipated, a larger means a larger output, thereby more computations needed. In Fig. 11(a), we can see that the growth of will lead to a significant increase in the number of partitions in , which also explains the increase in processing time of all algorithms. Fig. 11(b) records the relationship between the number of non-contained MACs obtained by GS-NC and . Similarly, as the number of partitions increases, the diversity of non-contained MACs w.r.t. will also increase. Note that both bars of Yelp in Fig. 11(a) and 11(b) are much shorter than those of Flixster while its size is larger in Fig. 11(c). Since attributes not only in Yelp but in real world are usually correlated or more, fewer (even unique) branches in DAG result in less processing time, that is, less half-space computation and insertion.
VII-B Case Study
NA+Aminer. We apply the road network of North America4 (NA) and the Aminer (aminer.org) for the first case study. The Aminer is a scientific collaboration network that incorporates authors in DB, DM, IR, and ML fields, comprised of vertices and edges. For each author we crawl four numerical attributes: h-index, #publications, activeness, and diverseness. To evaluate the effectiveness of MAC model in real world, we map each author to the location in NA according to its affiliation and use “Jiawei Han”, “Jian Pei”, “Philip S. Yu”, “Xifeng Yan”, who are renowned scientists in DM (i.e., relatively high scores), as query vertices. After setting and (with large enough), Fig. 15(a-d) show the top- MACs anywhere in . Furthermore, we compare MAC with different models. For InfC [4], Fig. 15(f, g) report the results involving , respectively taking #publications only and weighted sum (by ) as influence. In fact, InfC either cannot capture the characteristics of all attributes, or must be covered by a non-contained MAC (NC-MAC) if (e.g., Fig. 15(c)). For SkyC [8], there are two results, one is the same as NC-MAC in Fig. 15(a), while the other shown in Fig. 15(e) only contains partial and is covered by NC-MAC in Fig. 15(c). In effect, we find that SkyC is always contained in NC-MACs due to no query vertices and its skyline property. For ATC [7], Fig. 15(h) reports the -truss w.r.t. and keyword “DM” as a -truss is a -core. Although communication cost is low (i.e., ), its size is still too large since it only considers maximum inclusion of keywords but ignores attributes. Therefore, the MAC model is very effective, comprehensive and flexible for applications.








SF+Yelp. We apply the SF and the Yelp3 for the second case study. In addition to profile information (e.g., ID, first name, etc.), users in Yelp also have real attribute data, such as average rating of all reviews, #reviews written, #hot compliments and so on. Note that the Yelp is more like an LBSN composed of many relatively small ego-like networks, in which ego-like users usually have more fans or followers, write more reviews and receive more compliments. That is, these ego-like users are highly active, showing high attribute values in each dimension; while ordinary users have very low attribute values in all dimensions (e.g., only browsing without posting). In fact, we find that in real world attributes are usually correlated or more, e.g., most users in Yelp have an attribute value of . Thus, the number of branches in DAG will be extremely small or even unique, resulting in less half-space computation/insertion and fewer partitions in . We map each user to the location in SF according to check-ins and use “Emi”, “Phil”, “Dani”, “Michelle”, who are relatively active, as query vertices. To discover a group of people who are more concerned and popular, #hot compliments, #more compliments and #photo compliments are used as three numerical attributes here. By setting and , Fig. 16 shows the top- MACs in . This fully illustrates that in the real world, the diversity of (non-contained) MACs and the number of corresponding partitions w.r.t. are very small and user-friendly.



VIII Related Work
Community model and search. A large number of community models have been proposed such as -core [12, 2], -truss [3], maximal clique [30], quasi-clique [15], maximal -edge connected subgraph [31, 32, 33], locally densest subgraph [34], query-biased density [35], etc. All these models consider only graph structural information but ignore numerical/textual attributes associated with vertices. To discover cohesive subgraphs containing the query vertices, a community search problem (CSP) was studied to find the maximal connected -core in social networks [1], for which [2] proposed a more efficient local search algorithm. Recently, [7] introduced the CSP of small-diameter -truss with similar query attributes. [5] and [6] developed the CSP of -core with textual attributes and smallest minimum covering circle, respectively. [4] proposed an influential community with vertex’s influence as one numerical attribute, based on which [8] studied a skyline community for -dimensional numerical attributes. More recently, [36] studied the CSP of -truss with distance of at most for any two vertices. [37] and [38] investigated a cohesive version of CSP that brings all community members closest to the point-of-interest in road networks. [39] and [17] studied two different CSPs in terms of context with only query keywords but no query vertices. In addition, [40] and [41] made variations on the CSPs in [7] and [36], respectively.
This work differs from all the prior work in the following. (1) Our multi-attributed community (MAC) model is the first one that can incorporate uncertainty of user preferences in the weight vector into -dimensional numerical attributes and capture spatial cohesiveness between users in road networks. (2) The preference domain is introduced into community modeling for the first time. (3) We study the novel MAC search problems in road-social networks such that our techniques are significantly different from all previous CSP algorithms.
Skyline and its generalization. The r-dominance graph used in our MAC model is relevant to the skyline [21] and more to its extension, the -skyband [26]. In traditional top- queries, if two records are inconsistent and one has no smaller value in any dimension [21], then it dominates the other. Thus, for a dataset the skyline consists of the records which are not dominated by any other; while the -skyband comprises those dominated by fewer than other records [26], indicated as a superset of all records which for any weight vector may occur in the top- results. As a typical -skyband computation algorithm, BBS [26] adopts a spatial index in the dataset, following the branch-and-bound paradigm [42].
This proves an additional essential difference between [8] and our work from another perspective. Regardless of social and spatial cohesiveness, dominance in [8] between two communities comes down to a series of standard dominance tests on -dimensional vectors. However, r-dominance tests adopted in our model do not suffice, e.g., two or more non-skyline communities may still collaboratively disqualify a skyline one if they score higher than it at different parts of , collectively blocking it from being a top- result anywhere in .
IX Conclusions
In this paper, we propose a novel community model to discover normative communities suitable for multi-criteria decision making in a road-social network, in which each user is linked with location information and () numerical attributes. Taking a preference region of -dimensional data domain as input, the resulting communities identified by our model cannot be r-dominated by other ones as long as the weight vector could fall anywhere in the region. We formalize the multi-attributed community search; distinguish two problem versions; develop solutions for corresponding processing; and using both real-world and synthetic datasets demonstrate the efficiency and scalability of our solutions and the effectiveness of our model.
Acknowledgment
This work is supported by NSFC (No. 61932004, 61732003, 61729201 and 62072087) and Fundamental Research Funds for the Central Universities (No. N181605012). Ye Yuan is the corresponding author.
References
- [1] M. Sozio and A. Gionis, “The community-search problem and how to plan a successful cocktail party,” in SIGKDD, 2010, pp. 939–948.
- [2] W. Cui, Y. Xiao, H. Wang, and W. Wang, “Local search of communities in large graphs,” in SIGMOD, 2014, pp. 991–1002.
- [3] X. Huang, L. V. Lakshmanan, J. X. Yu, and H. Cheng, “Approximate closest community search in networks,” PVLDB, vol. 9, no. 4, pp. 276–287, 2015.
- [4] R. Li, L. Qin, J. X. Yu, and R. Mao, “Influential community search in large networks,” PVLDB, vol. 8, no. 5, pp. 509–520, 2015.
- [5] Y. Fang, R. Cheng, S. Luo, and J. Hu, “Effective community search for large attributed graphs,” PVLDB, vol. 9, no. 12, pp. 1233–1244, 2016.
- [6] Y. Fang, R. Cheng, X. Li, S. Luo, and J. Hu, “Effective community search over large spatial graphs.” PVLDB, vol. 10, no. 6, pp. 709–720, 2017.
- [7] X. Huang and L. V. Lakshmanan, “Attribute-driven community search,” PVLDB, vol. 10, no. 9, pp. 949–960, 2017.
- [8] R. Li, L. Qin, F. Ye, J. X. Yu, X. Xiao, N. Xiao, and Z. Zheng, “Skyline community search in multi-valued networks,” in SIGMOD, 2018, pp. 457–472.
- [9] T. Joachims, “Optimizing search engines using clickthrough data,” in SIGKDD, 2002, pp. 133–142.
- [10] B. Jiang, J. Pei, X. Lin, D. W. Cheung, and J. Han, “Mining preferences from superior and inferior examples,” in SIGKDD, 2008, pp. 390–398.
- [11] L. Qian, J. Gao, and H. Jagadish, “Learning user preferences by adaptive pairwise comparison,” PVLDB, vol. 8, no. 11, pp. 1322–1333, 2015.
- [12] S. B. Seidman, “Network structure and minimum degree,” Social networks, vol. 5, no. 3, pp. 269–287, 1983.
- [13] K. Mouratidis and B. Tang, “Exact processing of uncertain top-k queries in multi-criteria settings,” PVLDB, vol. 11, no. 8, pp. 866–879, 2018.
- [14] V. Batagelj and M. Zaversnik, “An o(m) algorithm for cores decomposition of networks,” CoRR, cs.DS/0310049, 2003.
- [15] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang, “Online search of overlapping communities,” in SIGMOD, 2013, pp. 277–288.
- [16] F. Zhang, Y. Zhang, L. Qin, W. Zhang, and X. Lin, “When engagement meets similarity: efficient (k, r)-core computation on social networks,” PVLDB, vol. 10, no. 10, pp. 998–1009, 2017.
- [17] Z. Zhang, X. Huang, J. Xu, B. Choi, and Z. Shang, “Keyword-centric community search,” in ICDE, 2019, pp. 422–433.
- [18] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in SIGMOD, 1984, pp. 47–57.
- [19] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and J. R. Smith, “The onion technique: indexing for linear optimization queries,” in SIGMOD, 2000, pp. 391–402.
- [20] B. Tang, K. Mouratidis, and M. L. Yiu, “Determining the impact regions of competing options in preference space,” in SIGMOD, 2017, pp. 805–820.
- [21] S. Borzsony, D. Kossmann, and K. Stocker, “The skyline operator,” in ICDE, 2001, pp. 421–430.
- [22] J. Liu, L. Xiong, J. Pei, J. Luo, and H. Zhang, “Finding pareto optimal groups: Group-based skyline,” PVLDB, vol. 8, no. 13, pp. 2086–2097, 2015.
- [23] I. F. Ilyas, G. Beskales, and M. A. Soliman, “A survey of top-k query processing techniques in relational database systems,” ACM Comp. Surveys, vol. 40, no. 4, pp. 1–58, 2008.
- [24] R. Zhong, G. Li, K.-L. Tan, L. Zhou, and Z. Gong, “G-tree: An efficient and scalable index for spatial search on road networks,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 8, pp. 2175–2189, 2015.
- [25] Z. Li, L. Chen, and Y. Wang, “G*-tree: An efficient spatial index on road networks,” in ICDE, 2019, pp. 268–279.
- [26] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “Progressive skyline computation in database systems,” ACM Trans. Database Syst., vol. 30, no. 1, pp. 41–82, 2005.
- [27] M. D. Berg, O. Cheong, M. V. Kreveld, and M. Overmars, Computational geometry: algorithms and applications. Springer, 2008.
- [28] L. Zou and L. Chen, “Pareto-based dominant graph: An efficient indexing structure to answer top-k queries,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 5, pp. 727–741, 2011.
- [29] P. K. Agarwal and M. Sharir, “Arrangements and their applications,” in Handbook of computational geometry. Elsevier, 2000, pp. 49–119.
- [30] J. Cheng, Y. Ke, A. W.-C. Fu, J. X. Yu, and L. Zhu, “Finding maximal cliques in massive networks,” ACM Trans. Database Syst., vol. 36, no. 4, pp. 21:1–21:34, 2011.
- [31] R. Zhou, C. Liu, J. X. Yu, W. Liang, B. Chen, and J. Li, “Finding maximal k-edge-connected subgraphs from a large graph,” in EDBT, 2012, pp. 480–491.
- [32] T. Akiba, Y. Iwata, and Y. Yoshida, “Linear-time enumeration of maximal k-edge-connected subgraphs in large networks by random contraction,” in CIKM, 2013, pp. 909–918.
- [33] L. Chang, J. X. Yu, L. Qin, X. Lin, C. Liu, and W. Liang, “Efficiently computing k-edge connected components via graph decomposition,” in SIGMOD, 2013, pp. 205–216.
- [34] L. Qin, R. Li, L. Chang, and C. Zhang, “Locally densest subgraph discovery,” in SIGKDD, 2015, pp. 965–974.
- [35] Y. Wu, R. Jin, J. Li, and X. Zhang, “Robust local community detection: on free rider effect and its elimination,” PVLDB, vol. 8, no. 7, pp. 798–809, 2015.
- [36] L. Chen, C. Liu, R. Zhou, J. Li, X. Yang, and B. Wang, “Maximum co-located community search in large scale social networks,” PVLDB, vol. 11, no. 10, pp. 1233–1246, 2018.
- [37] F. Guo, Y. Yuan, G. Wang, L. Chen, X. Lian, and Z. Wang, “Cohesive group nearest neighbor queries over road-social networks,” in ICDE, 2019, pp. 434–445.
- [38] F. Guo, Y. Yuan, G. Wang, L. Chen, X. Lian, and Z. Wang, “Cohesive group nearest neighbor queries on road-social networks under multi-criteria,” IEEE Trans. Knowl. Data Eng., 2020.
- [39] L. Chen, C. Liu, K. Liao, J. Li, and R. Zhou, “Contextual community search over large social networks,” in ICDE, 2019, pp. 88–99.
- [40] Q. Liu, Y. Zhu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “Vac: vertex-centric attributed community search,” in ICDE, 2020, pp. 937–948.
- [41] J. Luo, X. Cao, X. Xie, Q. Qu, Z. Xu, and C. S. Jensen, “Efficient attribute-constrained co-located community search,” in ICDE, 2020, pp. 1201–1212.
- [42] A. H. Land and A. G. Doig, “An automatic method of solving discrete programming problems,” Econometrica, vol. 28, no. 3, pp. 497–520, 1960.