Finding Key Structures in MMORPG Graph with Hierarchical Graph Summarization
Abstract
What are the key structures existing in a large real-world MMORPG (Massively Multiplayer Online Role-Playing Game) graph? How can we compactly summarize an MMORPG graph with hierarchical node labels, considering consistent substructures at different levels of hierarchy? Recent MMORPGs generate complex interactions between entities inducing a heterogeneous graph where each entity has hierarchical labels. Succinctly summarizing a heterogeneous MMORPG graph is crucial to better understand its structure; however it is a challenging task since it needs to handle complex interactions and hierarchical labels efficiently. Although there exist few methods to summarize a large-scale graph, they do not deal with heterogeneous graphs with hierarchical node labels.
We propose GSHL, a novel method that summarizes a heterogeneous graph with hierarchical labels. We formulate the encoding cost of hierarchical labels using MDL (Minimum Description Length). GSHL exploits the formulation to identify and segment subgraphs, and discovers compact and consistent structures in the graph. Experiments on a large real-world MMORPG graph with multi-million edges show that GSHL is a useful and scalable tool for summarizing the graph, finding important and interesting structures in the graph, and finding similar users.
keywords:
Graph summarization , minimum description length , hierarchical label , massively multiplayer online role-playing game1 Introduction
What does a large real-world MMORPG (Massively Multiplayer Online Role-Playing Game) graph look like? What are the key structures existing in a large real-world MMORPG (Massively Multiplayer Online Role-Playing Game) graph? How can we compactly summarize an MMORPG graph with hierarchical node labels, considering consistent substructures at different levels of hierarchy? A large number of interactions between heterogeneous entities appear in MMORPGs, and they are often represented as a heterogeneous graph. For example, a user has several characters, and each character with various equipment visits dungeons. Moreover, each entity has hierarchical labels (i.e., character-dealer-destroyer, and equipment-weapon). Summarizing a heterogeneous graph with hierarchical node labels is a crucial problem to understand its characteristics. A better understanding helps us identify important structures and obtain meaningful insights.
Several approaches have been proposed to summarize graphs using significant structures (vocabularies) commonly found in real-world graphs. VoG [1, 2] is the first approach to summarize a homogeneous graph with vocabulary terms by compressing the graph in terms of Minimum Description Length (MDL) principle. TimeCrunch [3] extended VoG method to summarize dynamic graphs. Both methods, however, fail to summarize a heterogeneous graph with hierarchical labels since they do not excogitate labels of nodes nor the hierarchy of labels.
We propose GSHL (Graph Summarization with Hierarchical Labels), a summarization method for a heterogeneous graph with hierarchical node labels. Based on the MDL principle that good compression leads to good summarization, we precisely define the encoding cost for heterogeneous graphs with hierarchical node labels. GSHL decomposes a heterogeneous graph into subgraphs using a graph decomposition method, and encodes each subgraph as a structure (e.g., star, clique, bipartite core, chain, etc.). Then GSHL further segments each structure by considering hierarchical labels of nodes in the structure, and summarizes the graph using the encoding cost. Thanks to GSHL, we analyze a large real-world MMORPG graph and find its key structures as well as similar users. Our contributions are as follows: {enumerate*}
Problem formulation. We formulate the problem of summarizing a heterogeneous graph with hierarchical labels. We use a large real-world graph extracted from Blade & Soul, a popular MMORPG played worldwide. This is the first work in literature that summarizes a large real-world MMORPG graph, to the best of our knowledge.
Scalable Method. We propose GSHL, a novel algorithm to summarize a heterogeneous graph with hierarchical node labels. GSHL carefully exploits MDL to encode hierarchical labels, and generates a succinct summary by segmenting subgraphs at different levels of hierarchy. GSHL is near-linear on the number of edges (see Fig. 9).
Experiment. Experiments on the real-world MMORPG graph reveals that GSHL is a useful tool for succinctly summarizing the graph (see Table 4). GSHL discovers interesting patterns (see Fig. 1 and 8), and similar users with similar structures (see Fig. 2).
In the rest of this paper, we describe data, MDL formulation, summary generation, experimental results, related works, and conclusion.





2 Data Description
Our target data is from Blade & Soul (https://www.bladeandsoul.com/en/), a popular MMORPG played in more than countries worldwide and earning revenue worth USD million in .
2.1 Blade & Soul
Blade & Soul entails multiple characters simultaneously interacting with each other in an open-world environment. Each character carries a variety of equipment acquired throughout its playtime; it enters various dungeons with other characters in order to acquire new equipment or items. Such game design encourages users to repetitively visit dungeons.
A user associated with an account can possess multiple in-game characters with different jobs. These jobs are divided into three types of play-styles: dealer, tanker, and buffer. A dealer mainly afflicts damage to the enemies of its party (a group of multiple characters), while a tanker taunts enemies so the party members can freely afflict damage, and a buffer aids other party members. A party needs to have balanced types of jobs to successfully enter a dungeon and fight with enemies.




2.2 Dataset
We collect months of data from January to March in 2019, and extract the following four relationships.
-
•
account - character: which character(s) each account possesses.
-
•
account - account: friendship between accounts.
-
•
character - dungeon: which dungeons each character visited.
-
•
character - equipment: which equipment each character is equipped with.
We extract types of entities: account, character, dungeon and equipment. An account represents a user, and a user can have multiple characters. A dungeon entity belongs to a type based on its difficulty where an advanced dungeon is a more difficult version of the corresponding normal dungeon.
2.2.1 Hierarchical Labels
Each label is further divided into sub-labels, and this leads to a hierarchical label structure summarized in Table 1. The four types of entities (account, character, dungeon, and equipment) extracted from the dataset represent level- of the hierarchy. The account type has no sub-labels. The character type has level- sub-labels based on play-styles. Each of the level- label of the character has level- sub-labels. The dungeon type has level- sub-labels. The equipment type has level- sub-labels.
Level-1 | Level-2 | Level-3 | Node Count |
Account | |||
Character | Dealer | Force Master | |
Destroyer | 23,327 | ||
Summoner | 6,266 | ||
Blade Dancer | 5,822 | ||
Zen Archer | 11,023 | ||
Tanker | Blade Master | 11,854 | |
Kung Fu Master | 17,845 | ||
Warden | 6,689 | ||
Buffer | Assassin | 18,460 | |
Warlock | 15,868 | ||
Soul Fighter | 18,249 | ||
Dungeon | Normal | 154 | |
Advanced | 12 | ||
Others | 133 | ||
Equipment | Weapon | 3,219 | |
Soul Shield | 3,400 | ||
Ring | 431 | ||
Bracelet | 398 | ||
Earring | 455 | ||
Belt | 95 | ||
Necklace | 430 | ||
Soul | 171 | ||
Heart | 47 | ||
Pet | 269 | ||
Glove | 26 | ||
Soul Badge | 927 | ||
Mystic Badge | 739 | ||
Talisman | 29 | ||
Total |
Account | Character | Dungeon | Equipment | |
Account | 229,338 | 154,550 | 0 | 0 |
Character | 154,550 | 0 | 2,680,520 | 4,821,079 |
Dungeon | 0 | 2,680,520 | 0 | 0 |
Equipment | 0 | 4,821,079 | 0 | 0 |
Total | 7,885,487 |
2.2.2 Graph
An entity, which is either account, character, dungeon, or equipment, becomes a node in the graph. A relationship between two entities forms an edge. There are four types of relationships: 1) (account - account), 2) (account - character), 3) (character - dungeon), and 4) (character - equipment) as shown in Fig. 3. Table 2 represents the number of edges between level- labels. The formed graph contains nodes and edges.
Fig. 4 shows the characteristics of the graph. The degree distributions of nodes for each level-1 label type are shown in Fig. 4(a). Note that degree distributions of all node types except character follow power-law. The degree distribution of character type nodes has a mode around the degrees due to the following reasons: 1) the number of items a character can have is limited, and 2) the number of dungeons a character can visit is limited. Fig. 4(b) shows the degree distribution of all nodes. Even though there is a mode around the degrees , the graph shows a skewed degree distribution which is also verified in Fig. 4(c) which shows the reordered adjacency matrix from SlashBurn [4, 5] method. Note that a reordered adjacency matrix with a thin arrow shape like Fig. 4(c) means there are few high degree nodes in the graph which can be used for decomposing (or ‘shattering’) it [4].
3 Proposed Method: MDL Formulation
3.1 Overview
Our goal is to construct a concise summary of an MMORPG graph with hierarchical node labels. There are several challenges for the goal.
How can we summarize an MMORPG graph with a few key subgraphs (vocabularies)?
In designing an optimization objective for the summary, how can we incorporate hierarchical node labels?
Given a subgraph, how can we extract consistent substructures at different levels of hierarchy?
Our ideas to solve the challenges are as follows.
Exploiting Minimum Description Length (MDL) allows us to extract key structures (e.g., clique, star, bipartite core, etc.) from the MMORPG graph. We add the formulation for hierarchical labels while exploiting the formulation for structures used in VoG [1].
Considering label consistency of structures enables us to evaluate various structures in depth. We carefully formulate label consistency for each key structure.
Segmenting labels of an inconsistent structure from higher to lower levels generates consistent substructures. By determining segmentation at each level of the hierarchy, we extract pivotal structures from a heterogeneous graph.
In the following sections, we present the problem formulation for heterogeneous graph summarization based on the MDL, and how to design an optimization objective for the summary. We describe how to encode the connectivity of each structure, hierarchical labels, and errors. Then, we describe how to generate a summary of the graph based on the formulation. We describe subgraph generation, subgraph identification based on the encoding cost of structures, structure segmentation for label consistent structures by considering hierarchical labels, and model construction approaches. Table 3 shows the symbols used.
Symbol | Description |
Graph with node set and edge set | |
Set of structure types | |
Star & chain resp. | |
Full & near clique resp. | |
Full & near bipartite core resp. | |
Model | |
Set of all possible models | |
Adjacency matrix of | |
Structure | |
Edges of included in a structure | |
Approximate adjacency matrix of induced by | |
Error matrix, | |
Exclusive OR | |
# of bits to encode using | |
# of bits to encode model | |
# of bits to encode a structure | |
# of bits to encode node labels in a structure | |
# of nodes in a structure | |
# of levels | |
Edges that were over-modeled and under-modeled resp. | |
Error encoding cost for and resp. | |
Encoding cost for labeling error |
3.2 MDL for Heterogeneous Graph Summarization with Hierarchical Labels
Minimum Description Length (MDL) [6] is a model selection method where the best model is the one that gives the best lossless compression:
is a model, is the set of all possible models, is the given data, is the length in bits of , and is the length in bits of when encoded using .
Our target model is an ordered list of graph structures each having a type from [1]. Each structure corresponds to a certain portion of the adjacency matrix describing its node’s labels and interconnectivity; there may be overlapped nodes between different structures, but there is no edge overlap. Each node has a hierarchical label (e.g., Table 1).
Following the MDL principle, our method for transmitting the adjacency matrix is the following. We 1) transmit model , 2) induce the approximate adjacency matrix of by filling out the connectivity of each structure in , 3) transmit the error which is derived from taking the exclusive OR between and as MDL requires lossless encoding, and 4) transmit labeling error which is the label information for nodes not included in model . We opt for a model that minimizes the encoding length of model , error , and labeling error . Our problem definition is as follows.
Problem 1 (Minimum Hierarchical Graph Description)
Given a heterogeneous graph with hierarchical labels on its nodes, its adjacency matrix , and set of structure types , find model that minimizes the total encoding length
where is the error matrix derived by , is the approximate adjacency matrix of induced by , and is the labeling error.
3.3 Encoding the Model
Full encoding length of a model is the following:
(1) |
We encode 1) the total number of structures in using , Rissanen’s optimal encoding [7] for integers greater than , 2) the number of structures for each type using an index over a weak number composition for each structure type in model , and 3) the structure type for each structure using optimal prefix code, 4) its connectivity , and 5) the hierarchical labels of nodes in it. The detailed encodings for connectivity and hierarchical labels are described in the following two sections, respectively.
3.4 Encoding Connectivity
We describe how to encode the connectivity of each structure and derive its cost .
Stars. A star has a distinct characteristic of having a single ”hub” node surrounded by 2 or more ”spoke” nodes. We compute of a star as follows, where is the number of nodes inside the star and is the number of nodes inside the given graph :
We encode 1) the number of spokes in the star, 2) the index of the hub node over all nodes, and 3) which nodes are included in the star’s spokes (spoke id).
Cliques. A full clique is a subset of vertices where every two distinct vertices are adjacent. We compute of a full clique as follows, where is the number of nodes inside the full clique:
We encode 1) the number of nodes in the full clique, and 2) the index of a permutation to select nodes out of nodes.
Near cliques. A near clique has several missing edges from a full clique. We compute of a near clique as follows:
where represents number of nodes inside , and is the number of edges in . We use and , respectively, for the number of observable and non-observable edges in . and represent the lengths of the optimal prefix code for observable and non-observable edges, respectively.
We encode 1) the number of nodes in the near clique, 2) the index of a permutation to select nodes out of nodes, 3) the number of edges, 4) the observable edges using the optimal prefix code, and 5) the non-observable edges.
Bipartite cores. A bipartite core consists of two sets of nodes and where edges exist only between the sets and not within them. A full bipartite core is a fully connected bipartite core. We compute of a full bipartite core as follows.
where we encode 1) the number of nodes in node sets and , and 2) ids of nodes in and .
Near bipartite cores. A near bipartite core has several missing edges from a full bipartite core. We compute of a near bipartite core as follows, similarly to a near clique:
Chains. A chain is a sequence of nodes where each node is connected to the previous and the next node of it. We compute of a chain as follows, where is the number of nodes in :
We encode 1) the number of nodes in the chain, and 2) their ordered connectivity.




3.5 Encoding Hierarchical Labels
We describe how to encode the hierarchical labels for the nodes in a structure and derive its cost . Each node in our target graph has a hierarchical label: e.g., a character node may have a three-level hierarchical label ”character-tanker-blade master”. Furthermore, we assume that a different node may have a different depth of its label: e.g., a dungeon node may have only two-levels of hierarchy, like ”dungeon-advanced”. When transmitting a graph to the recipient, we assume that the recipient also knows how the hierarchical labels are structured. We start by formalizing label encoding costs for node labels in level-, and describe a full label encoding cost for hierarchical labels.
Base encoding. In level-, the base label encoding cost of a structure is given by
where represents the number of nodes in the structure . is the number of unique sibling labels for the level- label of the -th node of the structure, where sibling labels of a label are those with the same upper level label (e.g. dealer, tanker, and buffer are sibling labels as they branch from character). For example, , , and are , and , respectively, when the labels of the -th node are character-dealer-destroyer (see Table 1).
Label consistency. We define the label encoding costs for two special cases: 1) consistent structure, and 2) role-consistent structure. The two special cases allow us to reduce the label encoding cost. A consistent structure is the one containing only nodes with the same label, as shown in Fig. 5(a). A role-consistent structure is the one where nodes with the same role have the same label, as shown in Fig. 5(b). We define 4 roles in this work: 1) hub in a star structure, 2) spoke (neighbor of a star) in a star structure, 3) node belonging to the first set in a bipartite core, and 4) node belonging to the second set in a bipartite core. An inconsistent structure is the one that is neither consistent nor role-consistent, as shown in Fig. 5(c).
The label encoding cost for a consistent structure in level- is given by
where of a consistent structure is the number of unique sibling labels for the level- label. For a consistent structure, substituting for reduces the label encoding cost.
Next, the label encoding cost for a role-consistent structure in level- is given by
where and of a role-consistent structure are the number of unique sibling labels for the level- labels of the two roles that a role-consistent structure can have, respectively. For a role-consistent structure, substituting for reduces the label encoding cost.
Hierarchical label. We define the full encoding cost for hierarchical labels as follows:
where and mark the level in which consistent and role-consistent structures end, respectively; and are set to when consistent and role-consistent cases do not exist, respectively. is the number of unique labels in level- and is the number of levels. We encode 1) the number of each label at level- by using an index over a weak number composition, 2) consistent cases’ labels from level- to level-, 3) role-consistent cases’ labels from level- to level-, 4) remaining levels using the base encoding for inconsistent cases, and 5) and . We consider the cases in the order of consistent, role-consistent, and inconsistent cases when moving from higher (e.g. level-1) to lower (e.g. level-3) levels.
3.6 Encoding Errors
Given the summary of , it is vital to encode edges that are under or over-modeled by the model which we refer to as errors. Specifically, there are two types of errors to consider. The first type of error transpires in connectivity; if is not identical to the regarding patch in , we encode the relevant errors. Another error type arises from labeling, where nodes that failed to be included in at least a structure lack label information.
3.6.1 Encoding Connectivity Errors
The encoding length of connectivity error of a model is the following:
and are edges that are over-modeled and under-modeled, respectively; and are the error encoding costs for and , respectively. We encode and separately as they are likely to have different distributions. Note that we ignore the errors from near cliques and near bipartite cores as they have been encoded exactly. and are defined as follows.
We encode first the number of 1s in or , and the 1s and 0s using the optimal prefix code.
3.6.2 Encoding Labeling Errors
We discuss encoding the labeling errors which occur for nodes that the model fails to cover, thus lacking the label information. Our idea is to include all considering nodes into one structure and apply base encoding for label encoding cost . Labeling error encoding cost is the following:
where is the number of nodes in labeling error node set , is the number of unique labels in level-, and is the number of unique sibling labels for the level- label of the -th node.
4 Proposed Method: Summary Generation
We present GSHL (Graph Summarization with Hierarchical Labels), our proposed summarization algorithm for heterogeneous graphs with hierarchical node labels. Algorithm 1 describes GSHL.
4.1 Subgraph Generation
We first decompose a given large graph into several subgraphs using a graph decomposition algorithm. We leverage SlashBurn [4] whose complexity is to generate subgraphs, where is the number of edges, is the number of nodes, and is the number of SlashBurn iterations. SlashBurn has been used for many graph tasks [8, 9, 10]; [1, 3] demonstrate that SlashBurn successfully decomposes real-world graphs for graph summarization. However, using other algorithms like SUBDUE [11] and BIGCLAM [12] is possible.
4.2 Subgraph Identification
For each generated subgraph, we find an appropriate structure in to minimize the encoding cost. For each subgraph, we 1) compute the encoding cost for all structure types, 2) compare the encoding costs, and 3) encode the subgraph as the structure type with the minimum encoding cost. We assume that each subgraph corresponds to one structure among the six ones.
We determine the role of each node in a given subgraph to encode the subgraph as one of the structure types. For example, which node is the hub of a star? Which nodes are included in set A or set B of a bipartite core? How to determine the order of nodes in a chain? We determine the role of each node in the following ways.
Stars. We encode the highest-degree node of the subgraph as the hub, and the remaining nodes as the spokes.
Cliques. All nodes have the same structural role in a clique.
Bipartite cores. We determine which nodes are included in set A or set B using binary classification for bipartite cores [1, 3]. We add the highest degree nodes to set A, and its neighbors to set B. Then, we perform Fast Belief Propagation (FaBP) [13] to propagate classes and add nodes to set A or B.
Chains. We adopt a heuristic approach [1, 3] to encode a subgraph as a chain. We 1) randomly pick a node and find a farthest node using breadth first search (BFS), 2) set the discovered node as starting node , 3) find a farthest node from and set it as end node , and 4) designate the shortest path from to as the initial chain. Additionally, we extend the initial chain. We construct an induced subgraph by removing all nodes included in the initial chain except for . We extend the chain by concatenating the shortest path from to a farthest node in the induced subgraph. We repeat the aforementioned process for .
After obtaining the structures from a subgraph, we compute and compare the encoding costs of them. Given a subgraph , we compute the local cost [1, 3] to encode it as a structure , where is the model consisting only of the subgraph encoded as , and is the error matrix derived by . is the adjacency matrix of the subgraph and is the approximate adjacency matrix of induced by . Note that the labeling errors are ignored since nodes in a structure are connected and thus there are no uncovered nodes inside the structure. We compute by adding and (e.g., for ). The remaining terms except for and in Equation (1) are negligible, since consists only of the structure . is equal to ; and are the error encoding costs for and , respectively, which are edges being over-modeled and under-modeled in the given subgraph , respectively.
We add the best structure minimizing the local encoding cost to the set of structures.


4.3 Structure Segmentation
We describe the segmentation of a structure considering hierarchical labels of nodes. The objective is to further reduce the label encoding cost for a structure considering the consistency of labels. Fig. 6 shows an example of comparison between a large inconsistent structure and small consistent structures. The inconsistent star graph in the left with 9 nodes has a low structure encoding cost, but a high label encoding cost. In contrast, the two decomposed star structures in the right have higher total structure encoding costs, but a lower total label encoding costs than the left case. We compare the total encoding costs before and after segmentation, and determine whether to segment or not by selecting the option leading to the minimum cost. We describe how to segment each structure, and how to extend the segmentation idea to hierarchical labels in the following two subsection, respectively.
4.3.1 Label Segmentation
We describe the segmentation of each structure in a level. We divide a structure into smaller structures of the same type only when the smaller structures provide a lower encoding cost.
Stars. We divide spokes of a star into two groups considering the consistency of labels, and construct two stars which have the same hub, but different spokes. We assign spoke nodes of the majority label to the first group and the remaining spoke nodes to the second group.
Bipartite cores. Bipartite cores have two sets of nodes. For each set, we search for the majority label in the set and divide the nodes of the set into the nodes with the majority label and the rest. As shown in Fig. 7, we consider cases depending on the segmentation for each set: 1) no segmentation, 2) segmenting only the first set, 3) segmenting only the second set, and 4) segmenting both sets. We choose the option with the minimum encoding cost among the 4 possible cases.
Cliques and chains. Contrary to star and bipartite core structures, all the nodes in cliques and chains have the same role. We first find the majority label, and then divide the structure into nodes of the majority label and the rest. We segment the given structure when the sum of the encoding costs of the two segmented structures is lower than that of the given structure.
4.3.2 Hierarchical Consideration
Given a structure with hierarchical labels, we segment the structure from higher to lower levels. For each level, we perform label segmentation to determine whether the structure should be segmented or not. Hierarchical label segmentation follows the following rules.
If a structure is consistent or role-consistent at the current level, the structure is challenged for label segmentation at the next lower level.
If an inconsistent structure is segmented and the current level is not the lowest the segmented structures are independently challenged for segmentation at the next lower level.
If an inconsistent structure is not segmented, we terminate the procedure and insert the structure in the set of the candidate structures.
At the lowest level, we compare the encoding costs and insert the structure or the segmented structures in the set of the candidate structures.
4.4 Model Summary
We construct the final model which summarizes the given heterogeneous graph by carefully selecting structures among the candidate structures . Finding the best model exactly by exploring all possible subsets of candidate structures is intractable. Thus we propose the following heuristics to choose a subset of candidate structures in a scalable way.
-
•
Vanilla: All candidate structures are included in our summary, thus is equals to the set of the candidate structures.
-
•
Top-k: We pick the top- structures by sorting the local encoding gains of candidate structures in descending order. A local encoding gain is the number of bits reduced by encoding a subgraph as a structure .
-
•
Benefit: We pick all structures whose local encoding gains are greater than zero.
GSHL uses the above approaches to finalize the model.
Description | Relative | Unexplained | ||
Method | # of structures | cost (bits) | cost | edges |
Original | ||||
GSHL-Vanilla | ||||
GSHL-Top-100 | ||||
GSHL-Benefit |
5 Experiment
We evaluate GSHL to answer the following questions:
-
Q1
Graph summarization. How well does GSHL summarize the Blade & Soul graph?
-
Q2
Discovery. What are the discovery results of analyzing Blade & Soul graph with GSHL?
-
Q3
Finding similar users. How can we find similar users using the summary by GSHL?
-
Q4
Scalability. How well does GSHL scale up in terms of the number of edges?
st | fc | nc | bc | nb | ch |
st | fc | nc | bc | nb | ch |
st | fc | nc | bc | nb | ch |
We use the Blade & Soul graph for experiments. We run experiments on a workstation with an Intel Xeon E5-2630 v4 @ 2.2GHz CPU and 512GB memory.
5.1 Graph Summarization (Q1)
We evaluate the description cost and edge coverage of Blade & Soul graph by GSHL. Our baseline is Original model where the original edges are encoded with and the labels are encoded with .
Table 4 compares the number of structures, the description cost, and the ratio of unexplained edges by Original and the three proposed models (GSHL-Vanilla, GSHL-Top-100, and GSHL-Benefit). GSHL-Vanilla summarizes the given graph with only of the bits w.r.t. Original and explains all but of the edges which are not modeled by . GSHL-Top-100 explains of the edges and summarizes the given graph with of the bits w.r.t. Original, while containing only structures out of structures. Each structure in GSHL-Top-100 contains a larger number of nodes and edges compared to the structures not included. Compared to GSHL-Vanilla, GSHL-Top-100 has fewer number of structures but a larger cost. GSHL-Benefit shows a reasonable balance between the description cost and the number of structures. Compared to GSHL-Vanilla, it has a similar description cost, but contains smaller number of structures.




5.2 Discovery (Q2)
We present the discovery results of Blade & Soul graph using GSHL.
5.2.1 Major Structures
How does GSHL summarize the Blade & Soul graph, and which structures can we discover? Table 5 shows the structures from our proposed methods GSHL-Vanilla, GSHL-Top-100, and GSHL-Benefit. Note that GSHL-Vanilla contains the most diverse structures including star, full cliques, near bipartite cores, and chains. GSHL-Top-100 contains only stars as they are the dominant structures when using SlashBurn for graph decomposition. GSHL-Benefit contains mainly stars, but it also contains several chains; it means that there are clear chain-like structures in the decomposed subgraphs.
5.2.2 Meaningful Structures
What interpretations can we make out of the summary? We report meaningful structures of Blade & Soul graph discovered by GSHL.
Popular dungeons and equipment. We observe that the hub node labels of star structures in GSHL-Top-100 mainly consist of popular dungeon and equipment nodes, surrounded only by character nodes. This is due to the nature of Blade & Soul where characters cooperate with other characters visiting various dungeons, while having various equipment.
Influential account. Fig. 8(a) shows a star structure consisting only of account nodes. The hub node is an influential account connected to other accounts.
Small clan. Fig. 8(b) shows account nodes forming a full clique structure. The accounts on the right possess characters which visited common dungeons. This represents a small clan where the accounts play together frequently.
Near bipartite core of characters and equipment. Fig. 8(c) shows a bipartite core; a blade master and two wardens are tankers in set A, and they are equipped with several soul shields in set B. This indicates that all the three characters have the same preference for the equipment.
Segmented star. We analyze structures segmented by considering hierarchical labels. Fig. 1 shows that a large star structure whose hub is an equipment node with soul label is divided into three smaller star structures. The spokes of each star structure consist of: a) zen archers, b) dealers except for zen archers, and c) all jobs in tanker and buffer. This segmentation occurs since of the spokes in the star structure are zen archers (out of the total jobs). Although any jobs can have this soul equipment, zen archers prefer this soul equipment the most.
5.3 Finding Similar Users (Q3)
How can we exploit the result of summarization for finding similar accounts in Blade & Soul? We measure account similarities as follows:
-
1.
Matrix construction. Given the found structures, we first construct a node-structure matrix . The -th element of the matrix is set to a value where is set to in this experiment and is the minimum length of the shortest paths between account and nodes in structure . If node is in structure , is set to .
-
2.
Randomized SVD for features. We compute a randomized SVD for the matrix to obtain features of accounts.
-
3.
Cosine similarity. We compute the cosine similarity of the two account feature vectors.
We pick a target account with id , and find the two most similar accounts which have ids 3602 and 9. We observe that they share common characters, equipment, and often friends. Fig. 2(a) shows the relation of accounts 0 and 3602. Note that they have the characters with the same jobs (destroyer and blade master). The corresponding characters have common equipment and visited the same dungeons; e.g., the two destroyers have identical equipment, and visited identical dungeons. Fig. 2(b) shows the relation of accounts 0 and 9. We observe the similar patterns of characters and their related equipment and dungeons. They also share 18 common friends as well.
5.4 Scalability (Q4)
We measure the running times of GSHL varying the number of edges for scalability experiments. We generate 4 synthetic subgraphs using Forest Fire Sampling (FFS) [14], where (# of nodes, # of edges) are , , , and . Note that FFS enables generated subgraphs to follow the properties of real world graphs: heavy-tailed distributions, densification law, effective shrinking diameters [14]. As shown in Fig. 9, GSHL scales near-linearly with the number of edges.
6 Related Work
We present related works on four main categories: minimal description length, graph compression and summarization, hierarchical graph, and game data mining.
Minimal description length. MDL [7] principle is a model selection method, where the best model is considered the one that gives the best lossless compression. Its use for compression [15] is related to summarization and pattern discovery. GSHL exploits the MDL principle for summarizing a heterogeneous graph with hierarchical node labels.
Graph compression and summarization. Works like eigendecomposition [16], modularity-based optimization [17], and cross association [18] find dense structures, such as cliques and bipartite cores. SUBDUE [11] discovers the best substructure that compresses a graph with node labels based on the MDL principle. Slashburn [4] reorders nodes of a graph to efficiently compress the graph by exploiting the characteristics of real-world graphs. [19] summarizes RDF graphs and [20] performs graph summarization for hashtag recommendation on social graphs. VoG [1] leverages the MDL principle to summarize a homogeneous graph with common structures. Shah et al. [3] propose TimeCrunch that extends VoG method for dynamic graphs. None of the above methods target a heterogeneous MMORPG graph, a heterogeneous graph with hierarchical labels, which GSHL focuses on.

Hierarchical graph. Hierarchical graphs have been studied for node embedding and classification. To effectively learn node embeddings in a graph, GNN methods use hierarchical structures. HGP-SL [21] and DiffPool [22] improve the performance of predicting unknown graph labels using hierarchical representations of graphs. StructureNet [23] proposes a generator using hierarchical graph, which produces diverse and realistic mesh geometries. Zhang et al. [24] use hierarchical information of labels to improve multi-label classification performance. Wendt et al. [25] propose a hierarchical label propagation algorithm for document classification. Our work is the first one to summarize a heterogeneous graph with hierarchical labels. Unlike GSHL, none of the above methods address the problem of hierarchical graph summarization.
Game data mining. There have been several works for analyzing game data. Drachen et al. [26] analyze the behaviors of players in MMORPGs using a clustering approach. Bernardi [27] propose a bot detection algorithm by analyzing the behaviors of players. Yang et al. [28] cluster game players by analyzing their purchase records. Thompson et al. [29] analyze chat messages of players in StarCraft 2 using a lexicon-based approach. We represent a popular MMORPG data as a heterogeneous graph, and analyze the graph for summarization.
7 Conclusion
In this paper, we analyze a massive real-world MMORPG graph using hierarchical graph summarization. For the purpose, we propose GSHL, a novel graph summarization algorithm for a heterogeneous graph with hierarchical node labels. Based on the MDL principle, GSHL decomposes a heterogeneous graph into subgraphs, identifies each subgraph as a structure, segments each structure by considering hierarchical labels of its nodes, and summarizes the graph. Through extensive experiments on a large real-world MMORPG graph, we show that GSHL is a useful and scalable tool for graph summarization. Thanks to GSHL, we find key structures and similar users in the MMORPG graph.
Acknowledgement
This research is supported from NCSOFT Co.
References
- [1] D. Koutra, U. Kang, J. Vreeken, C. Faloutsos, VOG: summarizing and understanding large graphs, in: SDM, 2014, pp. 91–99.
- [2] D. Koutra, U. Kang, J. Vreeken, C. Faloutsos, Summarizing and understanding large graphs, Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (3) (2015) 183–202.
- [3] N. Shah, D. Koutra, T. Zou, B. Gallagher, C. Faloutsos, Timecrunch: Interpretable dynamic graph summarization, in: SIGKDD, 2015, pp. 1055–1064.
- [4] U. Kang, C. Faloutsos, Beyond ’caveman communities’: Hubs and spokes for graph compression and mining, in: ICDM, 2011, pp. 300–309.
- [5] Y. Lim, U. Kang, C. Faloutsos, Slashburn: Graph compression and mining beyond caveman communities, IEEE Trans. Knowl. Data Eng. 26 (12) (2014) 3077–3089.
- [6] J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978) 465 – 471.
- [7] J. Rissanen, A universal prior for integers and estimation by minimum description length, The Annals of statistics (1983) 416–431.
- [8] K. Shin, J. Jung, L. Sael, U. Kang, BEAR: block elimination approach for random walk with restart on large graphs, in: SIGMOD, 2015, pp. 1571–1585.
- [9] J. Jung, K. Shin, L. Sael, U. Kang, Random walk with restart on large graphs using block elimination, ACM Trans. Database Syst. 41 (2) (2016) 12.
- [10] J. Jung, N. Park, L. Sael, U. Kang, Bepi: Fast and memory-efficient method for billion-scale random walk with restart, in: SIGMOD, 2017, pp. 789–804.
- [11] D. J. Cook, L. B. Holder, Substructure discovery using minimum description length and background knowledge, J. Artif. Intell. Res. 1 (1994) 231–255.
- [12] J. Yang, J. Leskovec, Overlapping community detection at scale: a nonnegative matrix factorization approach, in: WSDM, 2013, pp. 587–596.
- [13] D. Koutra, T. Ke, U. Kang, D. H. Chau, H. K. Pao, C. Faloutsos, Unifying guilt-by-association approaches: Theorems and fast algorithms, in: ECML-PKDD, 2011, pp. 245–260.
- [14] J. Leskovec, J. M. Kleinberg, C. Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, in: SIGKDD, 2005, pp. 177–187.
- [15] C. Faloutsos, V. Megalooikonomou, On data mining, compression, and kolmogorov complexity, Data mining and knowledge discovery 15 (1) (2007) 3–20.
- [16] N. Shah, A. Beutel, B. Gallagher, C. Faloutsos, Spotting suspicious link behavior with fbox: An adversarial perspective, in: ICDM, 2014, pp. 959–964.
- [17] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment 2008 (10) (2008) P10008.
- [18] D. Chakrabarti, S. Papadimitriou, D. S. Modha, C. Faloutsos, Fully automatic cross-associations, in: SIGKDD, 2004, pp. 79–88.
- [19] Š. Čebirić, F. Goasdoué, P. Guzewicz, I. Manolescu, Compact summaries of rich heterogeneous graphs.
- [20] M. Al-Dhelaan, H. Alhawasi, Graph summarization for hashtag recommendation, in: 3rd International Conference on Future Internet of Things and Cloud, FiCloud 2015, Rome, Italy, August 24-26, 2015, IEEE Computer Society, 2015, pp. 698–702.
- [21] Z. Zhang, J. Bu, M. Ester, J. Zhang, C. Yao, Z. Yu, C. Wang, Hierarchical graph pooling with structure learning, arXiv preprint arXiv:1911.05954.
- [22] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, J. Leskovec, Hierarchical graph representation learning with differentiable pooling, in: Advances in Neural Information Processing Systems 31, 2018, pp. 4800–4810.
- [23] K. Mo, P. Guerrero, L. Yi, H. Su, P. Wonka, N. Mitra, L. J. Guibas, Structurenet: Hierarchical graph networks for 3d shape generation, arXiv preprint arXiv:1908.00575.
- [24] L. Zhang, S. K. Shah, I. A. Kakadiaris, Hierarchical multi-label classification using fully associative ensemble learning, Pattern Recognition 70 (2017) 89–103.
- [25] J. B. Wendt, M. Bendersky, L. G. Pueyo, V. Josifovski, B. Miklos, I. Krka, A. Saikia, J. Yang, M. Cartright, S. Ravi, Hierarchical label propagation and discovery for machine generated email, in: WSDM, 2016, pp. 317–326.
- [26] A. Drachen, R. Sifa, C. Bauckhage, C. Thurau, Guns, swords and data: Clustering of player behavior in computer games in the wild, in: CIG, 2012, pp. 163–170.
- [27] M. L. Bernardi, M. Cimitile, F. Martinelli, F. Mercaldo, A time series classification approach to game bot detection, in: WIMS, 2017, pp. 6:1–6:11.
- [28] W. Yang, G. Yang, T. Huang, L. Chen, Y. E. Liu, Whales, dolphins, or minnows? towards the player clustering in free online games based on purchasing behavior via data mining technique, in: Big Data, 2018, pp. 4101–4108.
- [29] J. J. Thompson, B. H. M. Leung, M. R. Blair, M. Taboada, Sentiment analysis of player chat messaging in the video game starcraft 2: Extending a lexicon-based model, Knowl.-Based Syst. 137 (2017) 149–162.