GraMeR:Graph Meta Reinforcement Learning for Multi-Objective Influence Maximization

Sai Munikoti¹, Balasubramaniam Natarajan¹ and Mahantesh Halappanavar² ¹S. Munikoti and B. Natarajan are with Electrical and Computer Engineering, Kansas State University, Manhattan, KS-66506, USA. [email protected], [email protected]²M. Halappanavar is with Data Science and Machine Intelligence group, PNNL, Richland, USA. [email protected]This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Abstract

Influence maximization (IM) is a combinatorial problem of identifying a subset of nodes called the seed nodes in a network (graph), which when activated, provide a maximal spread of influence in the network for a given diffusion model and a budget for seed set size. IM has numerous applications such as viral marketing, epidemic control, sensor placement and other network-related tasks. However, the uses are limited due to the computational complexity of current algorithms. Recently, learning heuristics for IM have been explored to ease the computational burden. However, there are serious limitations in current approaches such as: (1) IM formulations only consider influence via spread and ignore self activation; (2) scalability to large graphs; (3) generalizability across graph families; (4) low computational efficiency with a large running time to identify seed sets for every test network. In this work, we address each of these limitations through a unique approach that involves (1) formulating a generic IM problem as a Markov decision process that handles both intrinsic and influence activations; (2) employing double Q learning to estimate seed nodes; (3) ensuring scalability via sub-graph based representations; and (4) incorporating generalizability via meta-learning across graph families. Extensive experiments are carried out in various standard networks to validate performance of the proposed Graph Meta Reinforcement learning (GraMeR) framework. The results indicate that GraMeR is multiple orders faster and generic than conventional approaches.

Index Terms:

graph neural networks, Q learning, influence maximization

I Introduction

Consider a directed graph $G=(V,E,\omega)$ , where $V$ is a set of vertices, $E$ is a set of edges (pairwise relationships on vertices), and $\omega$ is a set of edge weights; a diffusion model, and a budget $k$ . Influence maximization (IM) is the problem of identifying a set of $k$ seed nodes, which when activated, will result in the activation of a maximal number of nodes in $G$ , for the given diffusion model of influence. IM has applications in various domains ranging from viral marketing in social networks to influential proteins in biological networks. For instance, one of the reasons behind the tremendous success of social media platforms is the quality of content the users create or generate via sharing. These actions can be attributed to influence dynamics in the social network. IM was first introduced in 2001 and was formulated as a combinatorial optimization problem by [1]. Majority of the IM algorithms focus on settings where seed nodes are activated deterministically and then neighborhood nodes are activated via influence. However, there exists two types of activation in real-life scenarios, namely intrinsic and influenced [2]. In case of social media, intrinsic refers to users who post content whereas sharing, retweeting, commenting constitute influenced activation. To this end, the authors in [3] recognize that the events on social media can be categorized as exogenous and endogenous and model the overall diffusion through a multivariate Hawke’s process to address activity shaping in social networks. In another recent work [4], the authors propose a novel diffusion model based on factor graphs and graphical models where the node potentials can correspond to the notion of intrinsic activation. However, the focus of their work is on the diffusion model itself, not on the aspects of intrinsic activation.

Furthermore, despite the huge potential of IM, its field uses are limited due to the computational inefficiency of conventional greedy algorithms. Therefore, recently, there have been studies that use Reinforcement learning (RL) to learn generalized policy for combinatorial optimization problems on graphs [5, 6, 7] including IM. The key idea is to decompose the node selection process into a sequence, and learn a heuristic policy that selects nodes sequentially. The RL policy is usually trained on a set of seen training graphs, in the hope that it generalizes to unseen test graphs of similar characteristics. To better generalize the trained policy across different graphs, graph embedding techniques, such as Structure to Vector (S2V)[5] and Graph Convolutional Networks (GCNs) [8] are integrated as part of the RL value functions to extract the graph structure information. Primarily proposed to solve relatively simple tasks such as the traveling salesman problem (TSP) and the maximum vertex cover (MVC), some of the recent works [9, 10] extend it to the IM problem. However, they possess following shortcomings: (1) Formulation: Existing works such as GCOMB, S2V-DQN and GCN-TREESEARCH addresses a trivial IM problem without explicitly accounting the intrinsic and influence activations. Thus, these methods could not work effectively in real-world scenarios. (2) Computational complexity: Most of the conventional IM works such as CHANGE and [2] are computationally complex and thus possess limitations in transitioning solution to real-life uses. These methods require re-computation of IM nodes whenever a underlying network changes and usually the stakeholders do not have the HPCs to perform repetitive computation. (3) Scalability: To counter computational complexity of existing IM algorithms, RL based learning algorithms have been recently proposed in the literature. The primary focus of these methods (S2V-DQN and RL4IM) are on obtaining high quality results. Howevever, the efficiency studies are limited to graphs containing only hundreds of thousands of nodes. Real-life graphs may contain millions of nodes/edges. (4) Generalizability: Existing methods such as S2V-DQN, GCOMB and RL4IM does not systematically account for graph variation, i.e., trained RL model would work only on graphs of similar characteristics.

I-A Contributions

To address some of the above-mentioned shortcomings, this article proposes a novel Graph Meta Reinforcement learning (GraMeR) framework for influence maximization problem that affords both computational advantage and scalability to graphs of different sizes and families. Particularly, the article makes the contributions to the literature in the following aspects:

1.

The underlying IM formulation in this work is a more generic and realistic in terms of incorporating intrinsic and influence activation via probabilistic paradigm.
2.

A novel GNN based approach to prune the search space while inspecting for the most influential nodes. This improves the computational efficiency of the proposed framework.
3.

The RL algorithm of our framework, i.e., double deep Q learning learns the topological patterns which are recurring at every step of the Graph optimization problem. Therefore, once the model is trained, it can predict the appropriate sequence for a new graph in no time.
4.

Meta learning is achieved by feeding the input graph information along with the state vector while estimating the Q value (long term benefit) of feasible actions. This induces generalization in our framework by allowing predictions across graphs of different families.
5.

Integration of intrinsic and influence activation requires multi-objective RL formulation. A single policy multi Q learning approach is implemented to learn optimal policy for a given preferences of objectives.

The rest of the paper is organized as follows. Section II offers the background of IM. Section III describes the generic IM formulation followed by proposed candidate node predictor module in Section IV. SectionV presents the main DRL framework with experiments and results in Section VI. Final conclusions are provided in Section VII.

II Background and Related work

[11] and [1] constitute the seminal work that introduced Influence Maximization (IM) and provided a practical approach to solve it. This work was followed by several other works that explored various diffusion models and variations to the one discussed in [1] namely, the independent cascade (IC) model and the linear threshold model. For instance, in [12], the authors consider the task of identifying the individuals whose strong positive opinion about a product will maximize the overall positive opinion about the product. In the process, the authors leverage the social influence model proposed by Friedkin and Johnsen [13]. In recent works [4, 3], the authors recognize several types of activations in social network and propose methods to identify seed nodes under these activations.

Although the IM literature has become mature enough to tackle various real-world network scenarios, there exists a serious limitation attributed to the high computational costs in various applications [14, 15, 16]. Therefore, AI/ML community focused on developing algorithms for such computationally exhaustive tasks ranging from node identification tasks to combinatorial optimization on graphs [17, 18, 19]. The existing literature for combinatorial and graph problems can be broadly divided into two types: First, algorithms that pose combinatorial optimization into a sequential decision-making process and learn heuristics to support the primary optimizer [20, 21]. The application includes graph discovery in social network, restoration routes in critical infrastructure networks among others [22, 23]. Major drawbacks of these approaches are generalizability and data inefficiency. Therefore, the second category of work focuses on novel graph machine learning algorithms to tackle these shortcomings. Particularly, reinforcement learning (RL) can be used for solving graph optimization problems. However, functional approximators in conventional RL algorithms are not powerful enough to learn complex tasks. The advent of deep learning significantly boosts the approximation capability and enables RL to solve several tasks that were not possible before. [24] is one of the pioneering works in deep reinforcement learning (DRL), which combines attention-based function approximator with policy gradient Deep RL algorithms. Further, graph convolutional networks are more appropriate for capturing topological information in graphs. In this regard, [6] approximate solution with graph convolutional networks (GCN) embeddings, and uses a learning framework based on guided tree search to approximate DRL solutions. Moreover, there are some works where RL is combined with GCN to solve the traveling sales man problem [7]. Similarly, authors in [25], solved the widely known Knowledge Graph (KG) completion problem with DRL. Particularly, they employed GCN to learn a representation of KG and then use a deep deterministic policy gradient algorithm to estimate missing relationships.

Specific to IM, Li et al. and Tian et al. are a few efforts to solve the IM problem using deep Q learning. Here, reward corresponds to marginal gain in influence spread, and agent learns to find a node set that has maximum spread. Manchanda extended the method in Khalil by solving IM problem for very large graphs [9]. Specifically, it has a separate module to predict node quality which is fed as an input to the DRL agent. This induces computational burden in the training pipeline and the method of predicting node quality is not systematic. Recently, [10] developed a DRL algorithm for realistic IM by considering node willingness to be seed node. However, it requires the adjacency matrix as an input, and therefore, unscaleable for large graphs. In this paper, we propose a novel formulation for solving IM that addresses some of the challenges that we discussed so far.

III Formulation of Activation informed IM (AIM)

The overall architecture of the proposed framework (GraMeR) is depicted in Fig. 1, and consists of three main modules. Each of these modules is explained in forthcoming sections. In this section, we describe the IM model considered in our work. It is reasonably different and generic relative to typical IM models explored in related recent work. In a social network, users propagate their views or opinions while simultaneously consuming and reacting to content created by friends, people, and organizations they follow. Thus, there are two ways of activation, namely intrinsic (content creation) and spread of influence (content spreading). Conventional IM problem solely considers influence activation and overlooks intrinsic activation. However, in practice, the role of content creators has gained significant importance due to massive digitization. Therefore, in this work, we are considering a generic formulation of IM that incorporates both types of activations and is referred to here as activation-informed influence maximization (AIM).

Enabled by the digital revolution, most users now are both content creators and content spreaders at the same time. This is probabilistically modeled through parameters $p_{s}$ and $p_{f}$ which represent the probability of intrinsic and influence activation, respectively. Intrinsic activation for a user $u$ is based on its own activities, and user $u$ is assumed to be directly influenced by its 1-hop neighbors. Therefore, the influence part of the probability for activation is comprised of the activation probabilities due to the 1-hop neighbors of user $u$ . Thus, similar to the IC model, the probability of user $u$ being activated via influence of an user $v$ is written as:

p_{uv}=w_{uv}p_{f}(u),

(1)

where the weights $w_{uv}$ $(0\leq w_{uv}\leq 1)$ can be determined from the user interactions in a network. The described probabilistic formulation has similarities to the Friedkin-Johnsen social influence model for opinion change [13], where the authors recognize that the dynamics of opinion change are governed by two mechanisms: intrinsic opinion and influenced opinion. Furthermore, by assuming that the nodes are not lazy and are activated by either of the two mechanisms that we outline, we set $P_{f}(u)=(1-P_{s}(u))$ . This renders the overall IC probability between nodes $v$ and $u$ to be:

p_{uv}=w_{uv}(1-p_{s}(u)).

(2)

Note that all the parameters discussed can be efficiently determined either by a maximum-likelihood-based approach or expectation-maximization (EM) approach as followed in [26]. For example, in the Twitter network, the proportion of tweets by a user $i$ that are intrinsic in nature can quantify $P_{s}(u)$ , while a particular weight $w_{uv}$ can be determined by the proportion of user $u’s$ retweets (or influenced activity) having their origin in the activity of user $v$ that user $u$ follows.

IV Prediction of Candidate Nodes

A fundamental aspect of our framework is the fact that only a certain fraction of nodes are likely to contribute to the AIM solution set. These nodes are referred to as “candidate nodes” in this work. Hence, it is desirable that instead of focusing on all nodes, one should attempt all computationally expensive predictions for the candidate nodes. This section describes a novel GNN based node classifier that leverages GNN based classifier to identify candidate nodes for AIM. This classifier precedes our primary DRL algorithm (GraMeR) as shown in Fig. 1, and thereby eases the computational burden.

The task of identifying candidate nodes is framed as a binary node classification problem where the two classes denote “candidate” and “non-candidate” nodes, respectively. The ground truth labels can either be generated via standard greedy hill climbing algorithm or novel centrality metrics recently being proposed [27, 18]. This work uses Influence capacity metric as it is computationally easy to compute compared to other approaches. Influence capacity (IFC) is a novel centrality metric to identify influential nodes. The influence of any node depends on its neighbor connection (local influence) and its own location in the graph (global influence). Local influence of a node $u$ ( $I_{L}(u)$ ) can be estimated as [27]:

I_{L}(u)=1+\sum_{v\in N(u)}P(u,v)+\sum_{v\in N(u)}\sum_{z\in N(v)}P(u,v)P(v,z),

(3)

where, $P$ is the influence probability associated with links, and the operator $N(.)$ denotes the neighbors. Likewise, the global influence score ( $I_{G}(u)$ ) can be expressed as:

I_{G}(u)=k_{c}(u)\big{(}1+\frac{D(u)}{D_{N}}\big{)},

(4)

where, $k_{c}(u)$ and $D(u)$ represent coreness score and degree of node $u$ , respectively. Notation $D_{N}$ denotes maximum node degree in the graph. The overall influence capacity $I(u)$ of a node $u$ can then be written as:

I(u)=\frac{I_{L}(u)}{\text{max}_{v\in V}I_{L}(v)}\times\frac{I_{G}(u)}{\text{max}_{v\in V}I_{G}(v)}.

(5)

Nodes retaining extreme values of IFC are labeled as “candidate nodes” while the rest of the nodes are marked as “noncandidate”. The threshold of approx $20\%$ seems to work in our case and it is determined based on several experiments, where the candidacy of nodes is validated with the standard Greedy hill-climbing algorithm. The ground truths from IFC are then leveraged to train the node classifier without relying on a computationally exhaustive Greedy hill-climbing approach [16]. It is important to note that there are various other algorithms to generate groundtruth and one can take any such approach for training the node classifier.

The GNN based node classification model consists of two hidden layers with GraphSAGE as the message-passing algorithm. We selected GraphSAGE since the computation graph for any node $u$ only depends on the induced subgraph up to $k$ -hop neighbors of $u$ . This allows training/prediction across different graph sizes which is most desirable for GraMeR. The parameters of GNN are trained by minimizing the categorical cross-entropy loss function. The overall candidate node prediction task is summarized in Algorithm A.1 in the appendix. It is interesting to note that if this GNN classifier can provide a confidence interval around its class predictions as shown in [28], then the succeeding DRL engine can utilize that interval while searching for optimal seed nodes. This further strengthens the overall framework and is kept as future work.

V Meta Reinforcement Learner

Graph Meta Reinforcement Learner (GraMeR) is a deep reinforcement learning (DRL) based framework that learns to identify optimal seed set for AIM. There are various novel aspects of our framework in terms of: (1) Meta learning to enable prediction across different graph types and sizes; (2) double Q learning to estimate sequence of seed nodes without solving a computationally intensive optimization problem at every test time and (3) single policy multi-objective reward formulation for systematic balancing of multiple AIM objectives. This section describes these novel aspects of GraMeR, and architecture is shown in Fig. 1.

Refer to caption — Figure 1: Basic architecture of GraMeR. Graphs of different types and sizes are provided to Candidate node prediction module for identifying candidate nodes (C: Candidate; NC: Non-candidate). The Influencer set identification module uses Meta DRL algorithm to train on the candidate nodes of a graph.

V-A Finite MDP

The task of identifying an optimal set of AIM seed nodes is a sequential process where nodes are added one at a time. More importantly, the factors determining the selection of node at any step of the process solely depends on the last node added to the sequence (solution set). Thus, this process follows the Markov property and is therefore formulated as a Markov decision process (MDP). Further, at any step of the process, the action space is finite, i.e., a node has to be selected from a finite set of nodes. So, more precisely, the process can be termed as a finite MDP. The key ingredients to define any finite MDP in the context of DRL are:
State represents the current solution set where nodes are appended in sequence to form the final AIM solution set. Thus, cardinality of the state keeps increasing with the process. Therefore, a state representing vector ( $\mathbf{X^{t}}$ ) of fixed dimension is needed. $\mathbf{X^{t}}$ should characterize the state of the system at any time step $t$ in terms of nodes being selected. Therefore, $\mathbf{X^{t}}$ can be expressed as:

\mathbf{X^{t}}=f(S^{t})

(6)

where, $S^{t}$ is the partial AIM solution set at time $t$ and $f$ is the transformation operator. As nodes in the state are sampled from a graph, an appropriate choice for $f$ operator would be a Graph neural network based transformation which will be discussed in detail in the forthcoming sub-section.
Action refers to the process of adding a new node $u$ to a partial solution set $S^{t}$ .
Reward quantifies the benefit of taking an action. There are two objectives in our AIM formulation which leads to two reward functions. The first reward ( $R_{1}$ ) belongs to the marginal gain in influence spread when a particular node has been added to the solution set. The second reward ( $R_{2}$ ) corresponds to the intrinsic probability of node being added to the solution set. The rewards can be written as:

\begin{split}R_{1}(X,a)&=\mathcal{I}(G,S\cup\{a\})-\mathcal{I}(G,S)\\ R_{2}(X,a)&=p_{s}(a)\end{split}

(7)

where, the operator $\mathcal{I}$ computes the influence spread of solution set $S$ in graph $G$ under the independent cascade model.
Environment: It’s an agent world with which it interacts and comprises of everything outside the agent. Here, the environment is a graph. These interactions occur continually, i.e., agent selects node and the graph environment responds to those actions and present new situations to the agent.
Policy: The policy is a strategy or suggested actions that the DRL agent should take at every state of the environment so as to pursue the goal of the learning. It is a probability distribution over feasible nodes that could be added to partial solution set $S^{t}$ to move the state from $X^{t}$ to $X^{t+1}$ . Hence, policy $\pi(a|X^{t})$ selects the node that yields highest cumulative reward at any arbitrary state $X^{t}$ .
Termination: At every episode, the search starts with a random node from the candidate set, and the estimated nodes are appended to the partial solution set, one at a time ( one at each step of the episode). The episode is terminated when the cardinality of the solution set $S^{t}$ attains the search budget $b$ .

V-B Algorithm

GraMeR consists of three modules as illustrated in Fig. 1. The first module acts as an environment pool containing training graphs from different families and sizes. The second module provides a set of candidate nodes (described in Section 4) on which the AIM search algorithm will be implemented. Finally, the third module deals with the DRL agent. The process of training the agent starts by randomly selecting a graph $G_{i}$ from the pool and passing through second module to generate a candidate node set. Each training graph serves as an environment with which our agent interacts via MDP. An episode starts with a random node from the candidate set of the sampled graph and it continues until the budget is consumed. At each step of the episode, the agent will select the next node based on its current policy. Thereafter, the agent updates its current policy by training a Q network with a sampled batch of data from the buffer. Replay buffer stores the state, action and rewards from all the past steps of the process across episodes and environments (graphs), allowing Q network to exploit known information. Once the episode meets termination criterion, the agent samples a new graph and the training iterates until the policy converges.

Vanilla Q Learning: The agent in GraMeR trains via double Q-learning as it is a discrete finite MDP. The vanilla Q-learning maximizes a cumulative reward of actions taken during the interactions of agent with environment [29]. Reward at future times depends on actions taken at current time. The optimal value of an action (i.e., Q-value) corresponds to the optimal policy that maximizes the Q-value. Therefore, Q-value is iteratively updated according to Bellman equation as,

\begin{split}Q(X^{t},a^{t})=Q(X^{t},a^{t})+\\ \theta*[r^{t}+\gamma\max_{a^{\prime}}Q(X^{t+1},a^{\prime})-Q(X^{t},a^{\prime})]\end{split}

(8)

Q learning using Eq. (8) usually suffers from overestimation in practice due to the use of single estimator (Q-network) that determines the best action at next state with highest Q-value as well as the Q-value of that best action [30]. To avoid overestimation, double Q-leaning is proposed in [31], that uses two different estimators. One estimator (Local network $Q^{L}$ ) determines the best possible action for the next state and the other (target network $Q^{T}$ ) provides the Q-value of the selected action. The modified update equation of $Q^{L}(X^{t},a^{t})$ is,

\begin{split}Q^{L}(X^{t},a^{t})+\theta*[r^{t}+\gamma Q^{T}(X^{t+1},a^{*})-Q^{L}(X^{t},a^{t})],\\ a^{*}=\operatorname*{arg\,max}_{a^{\prime}}Q^{L}(X^{t+1},a^{\prime}),\end{split}

(9)

and $\theta$ is the tuning parameter. Local network ( $Q^{L}$ ) is trained at every step of the episode by sampling a batch of data from replay buffer. Mean squared error loss between the predicted Q value (i.e., $Q^{L}(X^{t},a^{t})$ ) and the desired Q value from the bellman equation ( $r^{t}+\gamma Q^{T}(X^{t+1},a^{*})$ ) is minimized to update the parameters of the local $Q^{L}$ network. While, the target network $Q^{T}$ is not explicitly trained at every step, it is continuously updated with the weights of entire $Q^{L}$ network after a certain number of episodes.

Meta Q Learning: A meta-learning attribute is introduced in the GraMeR to solve unseen tasks fast and efficiently. Here, the agent is expected to generalize to new graph types that have never been encountered during training. Typically, the Meta reinforcement learning approach contains two optimizer loops [32]. The outer optimizer samples a new environment in every iteration and adjusts parameters that determine agent behavior. In the inner loop, the agent interacts with the environment and optimizes for maximum reward. As in most of the environments (such as mazes, self driving car, etc.), it is not feasible to obtain a representing vector for entire environment. Therefore, learning across environments is captured via an outer optimizer. However, in the case of AIM, the environment is a graph, and it can be represented very accurately by a single graph embedding vector [33]. Hence, we skip the outer optimizer and feed the entire environment (graph) information ( graph embedding vector) along with state and actions to the training algorithm. This enables Q learning to capture the variation of environment via a single optimizer. Then the exact update equations of our agent turns out to be,

\begin{split}a^{*}=\operatorname*{arg\,max}_{a^{{}^{\prime}}}Q^{L}(X^{t+1},a^{\prime},G_{i})\\ Q^{L}(X^{t},a^{t},G_{i})=Q^{L}(X^{t},a^{t},G_{i})+\\ \theta*[r^{t}+\gamma Q^{T}(X^{t+1},a^{*},G_{i})-Q^{L}(X^{t},a^{t},G_{i})]\end{split}

(10)

where, $G_{i}$ corresponds to the sampled graph for the $i^{th}$ episode. It is worth noting that graphs changes with episodes but for a particular episode, a single graph is explored.

GNN encoding: Similar to the graph, we also need representing vectors for the state (partial solution set ) as well as action (node). In this regard, we leverage GraphSAGE [34] to estimate node embeddings. Thereafter, the embedding vectors of nodes in the state $\mathbf{S^{t}}$ are aggregated via mean/max operation to obtain a single representing vector $\mathbf{X^{t}}$ for the entire state. As action corresponds to a single node, it is represented by the corresponding node embedding vector. GraphSAGE is selected over conventional GCN [8] due to the factor that the candidacy of a particular node to be a part of AIM solution set depends mostly on its sub-graph. Therefore, GraphSAGE being an indutive sub-graph based learning approach is an appropriate choice. Further, GraphSAGE does not demand entire graph information unlike GCN, which requires the entire adjacency matrix and thus does not scale well with the size of the graph.

Multi-objective shaping: Along with states and action, the reward function needs to be explicitly designed for AIM as it comprises of multiple objectives. The first objective is related to maximizing influence spread and the other goal corresponds to maximizing intrinsic probability of seed nodes. Therefore, GraMeR belongs to the category of multi-objective DRL [35]. We take a single policy approach to learn one optimal policy by combining the two objectives with a known preference weight. Precisely, at each step of the episode, rewards (Eq. (7)) are computed individually for each objective and accumulated in the buffer along with state and actions. Then, Q values of two objectives are combined using linear scalarization technique to generate a single Q value that is used to select an action. This is different than a typical approach of combining rewards into a single value and consequently learning single Q value and it is shown to be a stable and efficient [36]. Algorithms 1 and 2 summarizes the entire GraMeR involving GNN based state/environment representation, meta learning across different graph types and multi-objective rewards shaping.

Algorithm 1 Pseudocode of GraMeR

0: set of training graphs

G

, input node features

X

, set of candidate nodes

C

0: Trained GraMeR agent.

Initialize: State

X\leftarrow\text{Max}

\{h_{u}\forall u\in S\}

2: Initialize: Graph embedding to

G_{i}

\leftarrow

\text{Max}\{h_{u}\forall u\in V\}

Initialize: Action

a

corresponds to selection of node

u

\leftarrow

h_{u}

where

h_{u}

denotes node embedding from GraphSAGE.

4: Initialize: Some arbitrary values to

Q_{1}(X,a)

Q_{2}(X,a)

for all state-action pairs.

for Loop for each episode do

6: Initialize:

X^{0}

to initial state (random node from

C

)

for Loop for each step of episode until state is terminal do

8: Select node

a^{*}

from scalarized action selection strategy

Append node

a

to partial solution set

S

10: Obtain rewards

r1,r2

from both objectives.

Move to next step

X^{{}^{\prime}}

12: Update

Q^{L}

of both objectives.

Q^{L}_{1}(X,a,G_{i})=Q^{L}_{1}(X,a,G_{i})+\theta*[r+\gamma Q^{T}_{1}(X,a^{*},G_{i})-Q^{L}_{1}(X,a,G_{i})]

14:

Q^{L}_{2}(X,a,G_{i})=Q^{L}_{2}(X,a,G_{i})+\theta*[r+\gamma Q^{T}_{2}(X,a^{*},G_{i})-Q^{L}_{2}(X,a,G_{i})]

end for

16: end for

update

Q^{L}_{1}

and

Q^{L}_{2}

parameters by minimizing cross entropy loss between target values from bellman equation and actual prediction from networks.

18: update

Q^{T}

network by copying

Q^{L}

parameters to

Q^{T}

at every

U

steps.

return Trained models

Q^{L}_{1}

and

Q^{L}_{2}

Algorithm 2 Scalarized action selection

0: Q values of both objectives for current state-action pair

(X,a)

0: action

a^{*}

Qnew =

W_{1}*Q_{1}(X,a)+W_{1}*Q_{2}(X,a)

Greedy action selection

a^{*}=\left\{\begin{array}[]{ll}\operatorname*{arg\,max}A_{a^{{}^{\prime}}}&\text{probability}\hskip 2.84544pt1-\epsilon\\ \text{random selection from set C}&\text{probability}\hskip 2.84544pt\epsilon\end{array}\right.

3: return action

a^{*}

VI Experimental Results

This section validates the proposed framework against modified greedy hill-climbing algorithm (MGHC) and S2VDQN. GraMeR offers improved performance with more flexibility while being orders of magnitude faster. The performance is examined on standard networks, namely Barabasi Albert, Power law cluster, stochastic block models. As per the architecture, $1^{st}$ phase of the local Q-network ( $Q^{L}$ ) comprises of $3$ GNN layers for generating node embeddings. The number of neurons in these layers are $64$ , $32$ and $16$ , respectively. This phase generates state, action and graph embedding vectors which are concatenated into a vector and passed through a regression phase of the $Q^{L}$ network. Regression phase consists of $2$ feedforward layers with $16$ and $1$ neurons respectively. All other settings have been discussed in the appendix.

VI-A Baselines

We have first validated the performance of the proposed candidate node prediction model with Greedy Hill-climbing (GHC) algorithm that selects seed nodes based on the marginal gain in influence spread. Then, GraMeR (core module) is compared with modified greedy Hill-climbing (MGHC) algorithm [2] that sample nodes based on intrinsic probability and select seed nodes which provide high gain in spread. Although this algorithm strives for two objectives (i.e, high intrinsic probability and high marginal gain in influence spread) of AIM, there is no inbuilt mechanism to control their priorities. In fact, none of the recent data-driven work addresses this type of multi-objective formulation of IM [5, 9]. In contrast, we have systematically incorporated the multi-objective formulation with a controlled weighing factor $\alpha$ . Furthermore, the baseline also includes modified S2V-DQN (MS2V-DQN) [5] that combines RL with graph representation. The baselines implementation is further discussed in the Appendix.

VI-B Results

This section demonstrates the performance of proposed framework against baselines via performance metrics and execution times. The evaluation is repeated $100$ times and average scores are reported for each test scenario. The source code can be found in the following anonymized github link https://anonymous.4open.science/r/InfluenceMaximization-Deep-QLearning-B20B.

VI-B1 Performance of Candidate node predictor

The candidate node prediction module is trained on $6$ graphs of varying dimension ( $200$ to $400$ nodes) and type (BA, PLC, BP). The ground truth value of node classes is obtained using influence capacity metric as discussed in Eq.(5). The model is evaluated on $3$ test graphs of $1000$ nodes, one from each graph family. The mean classification accuracy in detecting candidate and non-candidate nodes is around $96.24$ % with a recall score (class-1 accuracy) of $97.95$ %. Further, it has been shown via Fig. 2 that there is a noticeable reduction of $40$ % in algorithm running time compared to the greedy hill-climbing approach. This is due to a significant reduction in search space which can immensely speedup the training of GraMeR agent in later part of the learning pipeline.

VI-B2 Accuracy of GraMeR

The performance of GraMeR for activation aware influence maximization can be evaluated via two metrics namely influence spread and node intrinsic probability. Expected influence spread provides the mean number of nodes getting influenced in the network if seed nodes in the solution set are activated with corresponding intrinsic probabilities and information is spread via IC diffusion model. Influence spread is normalized between $0$ to $1$ for comparison across different graph sizes. The second metric (i.e., intrinsic probability) represents the probability of seeds nodes being activated on their own. The mean probabilities for all the seed nodes in the estimated solution set are reported.

Fig. 3 compares expected influence spread and mean intrinsic probability across different graph types and budget. It can be observed that the proposed GraMeR is at par with the baseline MGHC with much lower computational effort as demonstrated in the next subsection. Specifically, GraMeR consistently outperforms MGHC and MS2V-DQN in terms of spread whereas the reverse phenomenon can be seen in case of intrinsic probability. This is because the MGHC and MS2V-DQN does not have a mechanism to balance between influence spread and intrinsic probability. Therefore, they always provides the seed nodes having high intrinsic probability but sub-optimal in terms of influence spread. The scale of the plots is kept same for a fair comparison across different graph types. Further, it can be seen that for a similar graph size ( $600$ nodes) and diffusion model, the influence spread is maximum in case of SBM graph and minimum in BA. This is due to the fact that SBM have community structures where subsets of nodes are connected with each other through a large link densities. AIM led to the selection of seed nodes from different clusters which results in a high influence spread compared to BA graph that misses such clusters.

VI-B3 Computational gain of GraMeR

One of the key advantages of the proposed approach is the computational efficiency which is demonstrated via algorithm running times i.e. wall clock time. Fig. 4 depicts the running times across different budgets and graphs. It can be inferred that as the budget increases, the search time increases which is very intuitive. However, the increase in time is very sharp (linear for the experimented budgets) in the case of baseline MGHC, while it is almost constant for the proposed GraMeR. This is because the trained deep Q networks in GraMeR computes the solution set via forward propogation which mainly involves matrix operations. On the other hand, the MGHC searches for increasing number of nodes as the budget increases which eventually demands the computation of influence spread for large number of sets. Fig. 4 also illustrates that search algorithms for AIM are fastest in BA and slowest in SBM. This observation could be attributed to the high clustering coefficient in SBM leading to a large time cost associated with shifting from one cluster to an other while searching for an optimum seed set.

VI-B4 Generalizability of GraMeR

The core theme of our framework reinforces the property of generalizability as there are various types of real-world and synthetic networks that exist in the literature. Many of these graph types have only slight variation in their topological property. Thus, a separate DRL model to identify AIM seed nodes for each of these graph types is not needed rather, a meta learning can serve the purpose by learning across different environments (graph types). This fact is demonstrated by training on two graph types (i.e., PLC and BA) while validating on all the three graph families. The performance in terms of influence spread, intrinsic probability and running time is consistent across all three graphs as shown in Figs. 3 to 6.

VI-B5 Scalability of GraMeR

Scalability of GraMeR is summarized in Fig. 5, which presents the compute time for different graph sizes. Here, the model is trained on $8$ graphs ( $4$ from PLC and BA) each having $400$ nodes. Then, it is tested on graphs of sizes varying from $200$ to $20000$ . It can be inferred that GraMeR outperforms the baseline without any significant impact on computational effort. Specifically, running time for MGHC scales by $1500$ for $10x$ increase in graph size whereas, it remains almost constant for GraMeR. Further, to reinforce the scalability benefit, the training of GraMeR is carried for a fixed budget of $10$ but prediction is carried out for multiple budgets from $5$ to $20$ . The scalability can further be enhanced through distributed computing as shown by [37]. This work will address scaling to larger graphs on a wide variety of platforms (GPUs), which we plan to pursue in our future work.

VI-B6 Ablation study

The proposed GraMeR has computational supremacy over conventional methods because of two factors: (1) Deep Q networks based meta reinforcement learning approach to identify AIM solution set; and (2) candidate node predictor that reduces the search space. Fig. 6 depicts the ablation study results where running time is monitored for GraMeR and MGHC with and without the candidate node prediction module. The budget is fixed as $10$ and prediction is done for the PLC graph across different sizes. It can be seen that the time gap between the two cases increases with the graph size with nearly $1$ minute for GraMeR and $110$ minute for MGHC. This gap will further increase as the network size grows. Further, apart from prediction, noticeable gap is also seen in training time. This demonstrates the importance of the node prediction module in GraMeR.

VII Conclusions

We presented a GNN fused meta reinforcement learning framework (GraMeR) for identifying influential nodes in a network. Firstly, the search space of IM is reduced via GNN based candidate node predictor. Then deep Q learning is employed to learn to identify IM seed nodes with GNN as environment encoders. The unique aspects of GraMer lies in its computational efficiency and generalizability. Future directions of research include extension of GraMeR for uncertain networks suited for financial/manpower constraint applications.

References

[1] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 137–146.
[2] A. V. Sathanur, M. Halappanavar, Y. Shi, and Y. Sagduyu, “Exploring the role of intrinsic nodal activation on the spread of influence in complex networks,” in Social Network Based Big Data Analysis and Applications. Springer, 2018, pp. 123–142.
[3] M. Farajtabar, N. Du, M. G. Rodriguez, I. Valera, H. Zha, and L. Song, “Shaping social activity by incentivizing users,” Advances in neural information processing systems, vol. 27, 2014.
[4] T.-T. Quach and J. D. Wendt, “A diffusion model for maximizing influence spread in large networks,” in International Conference on Social Informatics. Springer, 2016, pp. 110–124.
[5] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song, “Learning combinatorial optimization algorithms over graphs,” arXiv preprint arXiv:1704.01665, 2017.
[6] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization with graph convolutional networks and guided tree search,” arXiv preprint arXiv:1810.10659, 2018.
[7] Y. Hu, Y. Yao, and W. S. Lee, “A reinforcement learning approach for optimizing multiple traveling salesman problems over graphs,” Knowledge-Based Systems, vol. 204, p. 106244, 2020.
[8] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[9] S. Manchanda, A. Mittal, A. Dhawan, S. Medya, S. Ranu, and A. Singh, “Gcomb: Learning budget-constrained combinatorial algorithms over billion-sized graphs,” 2020.
[10] H. Chen, W. Qiu, H.-C. Ou, B. An, and M. Tambe, “Contingency-aware influence maximization: A reinforcement learning approach,” arXiv preprint arXiv:2106.07039, 2021.
[11] P. Domingos and M. Richardson, “Mining the network value of customers,” in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’01. New York, NY, USA: Association for Computing Machinery, 2001, p. 57–66.
[12] A. Gionis, E. Terzi, and P. Tsaparas, “Opinion maximization in social networks,” in Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM, 2013, pp. 387–395.
[13] N. Friendkin and E. C. Johnsen, “Social influence networks and opinion change,” Adv Group Proc, vol. 16, pp. 1–29, 1999.
[14] A. Srivastava, R. Petering, N. Barr, R. Kannan, E. Rice, and V. K. Prasanna, “Network-based intervention strategies to reduce violence among homeless,” Social Network Analysis and Mining, vol. 9, no. 1, pp. 1–12, 2019.
[15] E. Rice, L. Onasch-Vera, G. Diguiseppi, C. Hill, R. Petering, N. Wilson, D. Woo, N. Thompson, M. Tambe, B. Wilder et al., “Using artificial intelligence to augment network-based, hiv prevention for youth experiencing homelessness,” in APHA’s 2020 VIRTUAL Annual Meeting and Expo (Oct. 24-28). American Public Health Association, 2020.
[16] M. Minutoli, M. Halappanavar, A. Kalyanaraman, A. Sathanur, R. Mcclure, and J. McDermott, “Fast and scalable implementations of influence maximization algorithms,” in 2019 IEEE International Conference on Cluster Computing (CLUSTER), 2019, pp. 1–12.
[17] S. Munikoti, L. Das, and B. Natarajan, “Scalable graph neural network-based framework for identifying critical nodes and links in complex networks,” Neurocomputing, vol. 468, pp. 211–221, 2022.
[18] O. A. Hussain and F. Zaidi, “Influence maximization in complex networks through supervised machine learning,” in International Conference on Complex Networks and Their Applications. Springer, 2021, pp. 217–228.
[19] S. Munikoti, L. Das, and B. Natarajan, “Bayesian graph neural network for fast identification of critical nodes in uncertain complex networks,” in 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2021, pp. 3245–3251.
[20] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” arXiv preprint arXiv:1506.03134, 2015.
[21] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou et al., “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, no. 7626, pp. 471–476, 2016.
[22] C. Ravazzi, F. Dabbene, C. Lagoa, and A. V. Proskurnikov, “Learning hidden influences in large-scale dynamical social networks: A data-driven sparsity-based approach, in memory of roberto tempo,” IEEE Control Systems Magazine, vol. 41, no. 5, pp. 61–103, 2021.
[23] S. Munikoti, K. Lai, and B. Natarajan, “Robustness assessment of hetero-functional graph theory based model of interdependent urban utility networks,” Reliability Engineering & System Safety, vol. 212, p. 107627, 2021.
[24] W. Kool, H. Van Hoof, and M. Welling, “Attention, learn to solve routing problems!” arXiv preprint arXiv:1803.08475, 2018.
[25] Q. Wang, Y. Ji, Y. Hao, and J. Cao, “Grl: Knowledge graph completion with gan-based reinforcement learning,” Knowledge-Based Systems, vol. 209, p. 106421, 2020.
[26] K. Saito, R. Nakano, and M. Kimura, “Prediction of information diffusion probabilities for independent cascade model,” in International conference on knowledge-based and intelligent information and engineering systems. Springer, 2008, pp. 67–75.
[27] S. S. Singh, A. Kumar, S. Mishra, K. Singh, and B. Biswas, “A centrality measure for influence maximization across multiple social networks,” in International Conference on Advanced Informatics for Computing Research. Springer, 2019, pp. 195–207.
[28] S. Munikoti, D. Agarwal, L. Das, and B. Natarajan, “A general framework for quantifying aleatoric and epistemic uncertainty in graph neural networks,” arXiv preprint arXiv:2205.09968, 2022.
[29] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
[30] J. E. Smith and R. L. Winkler, “The optimizer’s curse: Skepticism and postdecision surprise in decision analysis,” Management Science, vol. 52, no. 3, pp. 311–322, 2006.
[31] H. Hasselt, “Double q-learning,” Advances in neural information processing systems, vol. 23, pp. 2613–2621, 2010.
[32] M. Botvinick, S. Ritter, J. X. Wang, Z. Kurth-Nelson, C. Blundell, and D. Hassabis, “Reinforcement learning, fast and slow,” Trends in cognitive sciences, vol. 23, no. 5, pp. 408–422, 2019.
[33] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[34] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
[35] T. T. Nguyen, N. D. Nguyen, P. Vamplew, S. Nahavandi, R. Dazeley, and C. P. Lim, “A multi-objective deep reinforcement learning framework,” Engineering Applications of Artificial Intelligence, vol. 96, p. 103915, 2020.
[36] K. Van Moffaert, M. M. Drugan, and A. Nowé, “Scalarized multi-objective reinforcement learning: Novel design techniques,” in 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 2013, pp. 191–199.
[37] M. Minutoli, M. Halappanavar, A. Kalyanaraman, A. Sathanur, R. Mcclure, and J. McDermott, “Fast and scalable implementations of influence maximization algorithms,” in 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2019, pp. 1–12.
[38] P. Holme and B. J. Kim, “Growing scale-free networks with tunable clustering,” Physical review E, vol. 65, no. 2, p. 026107, 2002.

Appendix A Appendix

A-A Datasets

The performance of GraMeR is examined on standard networks. The description of datasets are as follows:

1.

Barabasi Albert (BA): Graphs which are grown by attaching new nodes each with fixed number of edges that are preferentially attached to existing nodes with high degree. Several natural and real-world networks, including the Internet, citation networks, and some social networks are learned to follow this model.
2.

Power law cluster (PLC): Graphs which exhibit both power law degree distribution and clusters, and many real-world networks manifest these properties. As shown in [38], one can construct a PLC graph by following a process of preferential attachment but in some fraction of cases (p), a new node $N_{y}$ connects to a random selection of the neighbors of the node to which $N_{y}$ last connected.
3.

Stochastic-Block model (SBM): A Graph generated with k clusters, and a $k\times k$ (symmetric) matrix $P$ of probabilities, where $P_{i,j}$ is the probability that a pair of nodes $(a,b)$ will be joined by a link if $a$ is in cluster $i$ and $b$ is in cluster $j$ .

A-B Algorithm of candidate node prediction

Algorithm 3 Training of Candidate node prediction module

0: set of training graphs, input node features

X_{u}

, node labels

Y_{u}(\text{candidate or non candidate})\hskip 2.84544pt\forall\hskip 2.84544ptu\epsilon V

0: Trained GNN

1: Initialize:

h_{v}^{0}=X_{v}

\forall\hskip 2.84544ptv\epsilon V

2: for layer

l=1

3

3: for node

u=1

u=V

h_{N(u)}^{l}

= Attention(

h_{k}^{l-1}

)

\forall\hskip 2.84544ptk\epsilon N(u)

;

h_{u}^{l}=

Relu(

[h_{u}^{l-1}||h_{N(u)}^{l}]

)

\forall\hskip 2.84544ptu\epsilon V

;

6: end for

7: end for

8: z = mean

(h_{u}^{l})

\forall\hskip 2.84544ptu\epsilon V

\hat{Y}_{u}

= Linear(

h_{u}^{l}||z

)

10: update GNN model by minimizing cross entropy loss between

\hat{Y}_{u}

and

Y_{u}

11: return Trained GNN model

A-C Experimental setup

All training and evaluation experiments are performed on a system with Intel i9-4820K processor running at 3.70GHz with 8 cores, having 1 Nvidia RTX 2080 Ti GPU with 12 GB memory, and 64 GB RAM. Evaluation is repeated $10$ times and average of metrics are reported.

A-C1 Evaluation settings

We first created a training pool by randomly generating $15$ graphs from each graph family (Barabasi-Albert, Power-law cluster, Stochastic block model) with the NetworkX library. Ten graphs of each category are kept for training, and five for testing. The parameters of GNN based candidate node predictor classifier are: Depth (# of node embedding module): 2; # of neurons in $2$ layers: 64,32; # of MLP layers: 3; # neurons in MLP layer: 12,8,1; Aggregator function: Mean; Activation function: ReLU (except last layer with sigmoid). The training is carried out via Adam optimizer with a learning rate of $0.0001$ . The output of this module is a masked graph with candidate and non-candidate nodes.

The training of GraMeR is carried out on the masked graphs via double Q learning. Since states and action are represented via node embeddings, parameters of the $1^{st}$ phase of the GNN model are: Depth (# of node embedding module): 3; # of neurons in $3$ layers: 64, 32, 16. Aggregator function: Max. The parameters of $2^{nd}$ phase (i.e., regression) are: # of MLP layers: 3; # neurons in MLP layer: 12,8,1; Activation function: ReLU (except last layer with linear). Moreover, the settings specific to DRL are: Replay buffer size: 10000; batch size: 64; gamma: $0.99$ ; Learning rate: $0.0008$ ; target network ( $Q^{T}$ ) update frequency: $10$ ; exploration decay rate: $0.996$ . Mean squared error is used as a loss function to update $Q$ network weights with Adam optimizer. Further, validation is conducted after each step of training, where the mean of reward functions on validation graphs is monitored. The model is trained until rewards are saturated, which takes around $1000$ episodes in our case. Thereafter, the best model is evaluated on $15$ test graphs, $5$ for each graph family. All training and validation is carried out in PyTorch with the support of Deep Graph Library.

A-C2 Baseline details

As a first baseline for AIM seed nodes, we have implemented a modified greed hill-climbing algorithm (MGHC) proposed in [2]. Let $S^{t}$ be the partial solution set of influential nodes at step $t$ . The classical greedy hill-climbing optimizer expands the set to size ( $t+1$ ) by polling each of the nodes not in $S^{t}$ and augmenting those nodes, one at a time to form the set $S^{t}\cup\{u\}$ and looking for the best marginal gain in terms of activations (no. of nodes influenced in the entire graph). At each such step $t$ , instead of activating every single node in $S^{t}\cup\{u\}$ and then computing activations according to the desired diffusion model, each node in the set $S^{t}\cup\{u\}$ is activated probabilistically with the corresponding node intrinsic probability thereby simulate the intrinsic activation process. The modified part is depicted in line $9$ of Algorithm 1 in App A. Given the probabilistic nature of the algorithm, the overall activation numbers are obtained by running the diffusion model in a Monte Carlo fashion that invokes $M$ independent trials of randomized graphs.