Learnability of Competitive Threshold Models

Anonymous submission

Abstract

Modeling the spread of social contagions is central to various applications in social computing. In this paper, we study the learnability of the competitive threshold model from a theoretical perspective. We demonstrate how competitive threshold models can be seamlessly simulated by artificial neural networks with finite VC dimensions, which enables analytical sample complexity and generalization bounds. Based on the proposed hypothesis space, we design efficient algorithms under the empirical risk minimization scheme. The theoretical insights are finally translated into practical and explainable modeling methods, the effectiveness of which is verified through a sanity check over a few synthetic and real datasets. The experimental results promisingly show that our method enjoys a decent performance without using excessive data points, outperforming off-the-shelf methods.

1 Introduction

Social contagion phenomenon such as the propagation of news and behavior patterns, or the adoption of opinions and new technologies, have attracted huge research interests from various fields like sociology, psychology, epidemiology and network science cozzo2013contact; Goldstone2005ComputationalMO; camacho2020four. Formal methods for modeling social contagion are crucial to many applications, for example, the recommendation and advertising in viral marketing qiu2018deepinf and misinformation detection tong2020stratlearner. Given the operational diffusion models, a fundamental problem is to which extend we can learn such models from data, i.e., the learnability of the models. In this paper, we study the classic threshold models in which the core hypothesis is that the total influence from active friends determines the contagion per a threshold. We seek to derive its PAC learnability and sample complexity by which efficient learning algorithms are possible.

Motivation.

Learning contagion models from data have been a central topic in social computing. Some existing works focused on estimating the parameters on diffusion models by local estimation techniques to further predict the influence results goyal2010learning; du2012learning; de2014learning. Most of their works made assumptions on the fully observed cascade data. In real life, we cannot always obtain the data set with local diffusion information. In our work, we also consider data set with incomplete observations. Moreover, our learnability results make no assumptions on the data type. Another line of work developed various graph neural network models to estimate the influence of each node combined with the deep learning techniques li2017deepcas; qiu2018deepinf; leung2019personalized. Although these methods have shown significant improvements in influence estimation performance, they do not have a theoretical guarantees of their approaches. In our work, we aim to show the analytical sample complexity for influence learning.

Recently, some works have focused on learning the influence function for different diffusion models from the training data, which maps from the seed set to the diffusion results. The works of du2012learning; he2016learning learn the approximate influence function by parameterization of a group of random reachability functions. Our approach will learn the influence function directly from their hypothesis space without any estimation. The most related works proposed the proper PAC learning algorithms to learn the influence function narasimhan2015learnability; adiga2019pac. Their work assumed that only one cascade diffused in the network. However, it is common that multiple competitive cascades disseminate simultaneously, for example, the network marketing campaign qiu2018deepinf. From this standpoint, we takes a step towards the influence estimation problem without limiting the number of cascades.

Contribution.

Our contribution can be summarized as follows.

•

A influence function class under competitive threshold model with finite VC dimensions: we simulate the diffusion process as the information propagation through the neural network with all piecewise polynomial units.
•

Proper PAC learning algorithms: We propose the learning algorithms under two data types following the empirical risk minimization learning strategy. In particular, we develop a polynomial time approach via linear programming under full observation.
•

Superior performance: The experiment results on different data set have demonstrated a better prediction performance for our proposed approaches.

2 Preliminaries

We first describe the diffusion model and then present the learning settings.

2.1 Competitive linear threshold model

We follow the standard competitive linear threshold (CLT) model ?, which has been widely adopted in the formal study of social contagions ?. A social network is represented as a directed graph $G=(V,E)$ , where $V=\{v_{1},...,v_{N}\}$ is the set of nodes and $E$ is the set of edges, with $N=|V|$ and $M=|E|$ being the numbers of nodes and edges, respectively. For each node $v\in V$ , we denote by $N(v)\subseteq V$ the set of its in-neighbors.

We consider the scenario where there are $S\in\mathbb{Z}^{+}$ information cascades, each of which starts to spread from their seed set $\psi_{0}^{s}\subseteq V$ for $s\in[S]$ . Associated with each node $v_{i}$ and each cascade $s$ , there is a threshold $\theta_{i}^{s}\in[0,1]$ ; associated with each edge $(v_{i},v_{j})\in E$ and each cascade $s$ , there is a weight $w_{i,j}^{s}\in[0,1]$ . Without loss of generality, we assume that the seed sets are disjoint and the weights are normalized such that $\sum_{v_{j}\in N(v_{i})}w_{(j,i)}^{s}\leq 1$ for each node $v_{i}\in V$ and cascade $s$ . The nodes are initially inactive and can be $s$ -active if activated by cascade $s$ . In particular, the diffusion process unfolds step by step as follows:

•

Step $\mathbf{0}$ : For each cascade $s\in[S]$ , nodes in $\psi_{0}^{s}$ become $s$ -active.
•

Step $\mathbf{t>0}$ : Let $\psi_{t-1}^{s}\subseteq V$ be the set of $s$ -active nodes after $t-1$ step. There are two phases in step $t$ .
- –
  
  Phase 1: For an inactive node $v_{i}\in V$ , let $\zeta^{s}_{i}$ be the summation of the weights from its $s$ -active in-neighbors:
  
  $\displaystyle\zeta^{s}_{i}=\sum_{v_{j}\in N(v_{i})\bigcap\psi^{s}_{t-1}}w_{(j,i)}^{s}.$ (1)
  
  The node $v_{i}$ is activated by cascade $s$ if and only if $\zeta_{i}^{s}\geq\theta_{i}^{s}$ . After phase 1, it is possible that a node can be activated by more than one cascades.
- –
  
  Phase 2: For each node $v_{i}$ , it will be finally activated by the cascade $s^{*}$ with the largest weight summation:
  
  $\displaystyle s^{*}=\operatorname*{arg\,max}_{s:\zeta^{i}_{s}\geq\theta^{i}_{s}}\zeta_{i}^{s}.$ (2)
  
  For tie-breaking, cascades with small indices are preferred.

Refer to caption — Figure 1: Diffusion process example. In this example, we have two cascades with seed sets $\{v_{2}\}$ and $\{v_{1}\}$ . The weights and thresholds are given as graph shows. According to the CLT model, $v_{3}$ will become $1$ -active after one time step. $v_{4}$ will become $1$ -active at the end of diffusion process.

Clearly, there are at most $n$ diffusion steps, and without loss of generality, we may assume that there are always $n$ diffusion steps. An illustration is given in Figure 1. The following notations are useful for formal discussions.

Definition 1.

We use a binary matrix $\mathbf{I}_{0}\in\{0,1\}^{N\times S}$ to denote the initial status, where $\mathbf{I}_{0,i,s}=1$ if and only if node $v_{i}$ is in the seed set $\psi_{0}^{s}$ of cascade $s$ . For a time step $t>0$ , the diffusion status is denoted by a binary tensor $\mathbf{I}_{t}\in\{0,1\}^{N\times S\times 2}$ , where, for a cascade $s$ , $\mathbf{I}_{t,i,s,p}=1$ if and only if node $v_{i}$ is $s$ -active after phase $p\in\{1,2\}$ in time step $t$ .

We are interested in the learnability of the influence function of CLT models.

Definition 2.

Given a CLT model, the influence function $F:\{0,1\}^{N\times S}\mapsto\{0,1\}^{N\times S}$ maps from the initial status $\mathbf{I}_{0}$ to the final status $\mathbf{I}_{N,:,:,2}$ .

2.2 Learning settings

Assuming that the social graph $G$ is unknown to us, we seek to learn the influence function using a collection $D$ of $K\in\mathbb{Z}^{+}$ pairs of initial status and resulted the statuses:

\displaystyle D=\Big{\{}(\mathbf{I}_{0}^{1},\mathbf{I}_{1:N,:,:,2}^{1}),...,(\mathbf{I}_{0}^{K},\mathbf{I}_{1:N,:,:,1:2}^{K})\Big{\}}

(3)

where $\mathbf{I}_{1:N,:,:,1:2}^{k}$ denotes the diffusion status after each phase in each diffusion step:

\mathbf{I}_{1:N,:,:,2}^{k}=\Big{[}\mathbf{I}_{1,:,:,1}^{k},\mathbf{I}_{1,:,:,2}^{k},\mathbf{I}_{2,:,:,1}^{k},\mathbf{I}_{2,:,:,2}^{k},...,\mathbf{I}_{N,:,:,1}^{k},\mathbf{I}_{N,:,:,2}^{k}\Big{]}.

We use the zero-one loss $\operatorname{\mathcal{\ell}_{0-1}}$ to measure the difference between the prediction $\hat{\mathbf{I}}_{N,:,:,2}$ and the ground truth $\mathbf{I}_{N,:,:,2}$ , which is defined as

	$\displaystyle\operatorname{\mathcal{\ell}_{0-1}}(\mathbf{I}_{N,:,:,2},\hat{\mathbf{I}}_{N,:,:,2})\operatorname{\coloneqq}\hskip 128.0374pt$
	$\displaystyle\hskip 56.9055pt\frac{1}{2NS}\sum_{i\in[N]}\sum_{s\in[S]}\mathds{1}\big{[}\mathbf{I}_{N,i,s,2}\neq\hat{\mathbf{I}}_{N,i,s,2}\big{]}$

For a distribution $\mathcal{D}$ over the input $\mathbf{I}_{0}$ and a CLT model generating the training data $D$ , we wish to leverage $D$ to learn a mapping $\hat{F}:\{0,1\}^{N\times S}\mapsto\{0,1\}^{N\times S}$ such that the generalization loss can be minimized:

L(\hat{F})\operatorname{\coloneqq}\mathbb{E}_{\mathbf{I}_{0}\sim\mathcal{D}}\Big{[}\operatorname{\mathcal{\ell}_{0-1}}\big{(}{F}(\mathbf{I}_{0}),\hat{F}(\mathbf{I}_{0})\big{)}\Big{]}.

(4)

3 Learnability analysis

In this section, we present a PAC learnability analysis for the influence function. The overall idea is to design a realizable hypothesis with a finite VC dimension by which the PAC learnability can be established. The following assumptions will be made throughout the analysis.

Assumption 1.

We assume that the diffusion model is a decimal system of a finite precision: for the involved parameters (i.e., $w_{i,j}^{s}$ and $\theta_{i}^{s}$ ), number of their decimal places is upper bounded by a constant $Q\in\mathbb{Z}^{+}$ . any references to support this one?

Assumption 2.

The cascade number $S=2^{Z}$ is a power of $2$ for some constant $Z\in\mathbb{Z}^{+}$ .

3.1 Realizable hypothesis space

In order to obtain a realizable hypothesis space, we seek to explicitly simulate the diffusion function through neural networks. While the universal approximation theory ensures that such neural networks exist, we wish for explicitly designs of a polynomial size so as to admit efficient learning algorithms. To this end, we first demonstrate how to simulate the one-step diffusion under the CLT model, which can used repeatedly to simulate an arbitrary number of diffusion steps.

For the one-step diffusion, its simulation is done through two groups of layers that are used in turn to simulate phase 1 and phase 2. As illustrate is in Figure 2, the simulation starts by receiving the diffusion status $\mathbf{H}^{1}\in\{0,1\}^{N\cdot S}$ after the last step, where for $i\in[N]$ and $s\in[S]$ , $\mathbf{H}^{1}_{(i-1)\cdot S+s}=1$ if and only if node $v_{i}$ is $s$ -active. After $2Z+4$ layers of transformations, it ends by outputting a vector $\mathbf{H}^{2Z+5}\in\{0,1\}^{N\cdot S}$ , which corresponds exactly to the diffusion status after one diffusion step under the CLT model. The layer transformations all follow the canonical form:

\mathbf{H}^{l+1}=\sigma^{l+1}(\mathbf{W}^{l}\mathbf{H}^{l})

(5)

with $\mathbf{W}^{l}$ being the weight matrix and $\sigma^{l}$ being a collection of activation functions over the elements of $\mathbf{W}^{l}\mathbf{H}^{l}$ . In the rest of this subsection, we will present our designs of the weight matrices and the activation functions that can fulfill the goal of simulation.

3.1.1 Phase 1

The single-cascade activation can be simulated by

\displaystyle\mathbf{H}^{2}=\sigma^{1}(\mathbf{W}^{1}\mathbf{H}^{1}),

(6)

where for each $i,j\in[N]$ and $s_{1},s_{2}\in[S]$ , we have

	$\displaystyle\mathbf{W}^{1}_{(i-1)\cdot S+s_{1},(j-1)\cdot S+s_{2}}\operatorname{\coloneqq}\hskip 128.0374pt$
	$\displaystyle\hskip 56.9055pt\begin{cases}w_{i,j}^{s_{1}}&\text{if $(v_{i},v_{j})\in E$ and $s_{1}=s_{2}$}\\ 1&\text{if $i=j$ and $s_{1}=s_{2}$}\\ 0&\text{otherwise}\end{cases},$

and for each $i\in[N]$ and $s\in[S]$ , we have

\sigma^{1}_{(i-1)\cdot S+s}(x)\operatorname{\coloneqq}\begin{cases}x&\text{if $x\geq\theta_{i}^{s}$}\\ 0&\text{Otherwise}\end{cases}.

(7)

For each $i\in[N]$ and $s\in[S]$ , the resulted $\mathbf{H}^{2}_{(i-1)\cdot S+s}$ is equal to the influence summation of cascade $s$ at node $v_{i}$ if $v_{i}$ is activated by cascade $s$ , and otherwise zero.

3.1.2 Phase 2

In simulating phase 2, the general goal is to figure out among the candidate cascades which one wins the competition. Intuitively, this can be implemented through a hierarchy of pairwise comparisons. One technical challenge is that comparison made directly by linear functions tells only the difference between the summations, while we need to keep the summation for future comparisons; we observe that this problem can be solved by using two consecutive layers of piecewise linear functions. A more serious problem is that simply passing the largest influence summation to following layers causes the loss of the information about the cascade identity (i.e. index), making is impossible to identify the node status – which is required as the input to simulating the next diffusion step. We propose to address such a challenge by appending the cascade identity to the weight summation as the lowest bits. The total shifted bits are determined by the system precision (Assumption 1). Doing so does not affect the comparison result while making it is possible to retrieve the cascade identity by scale and modulo operations. In particular, two groups of transformations are employed to achieve the pairwise comparison and status recovery.

Pairwise comparison. With the input $\mathbf{H}^{2}$ from phase 1, we encode the cascade identity through

\displaystyle\mathbf{H}^{3}=\sigma^{2}(\mathbf{W}^{2}\mathbf{H}^{2}),

(8)

where for each $i\in[N\cdot S]$ and $j\in[N\cdot S]$ , we have

\mathbf{W}^{2}_{i,j}\operatorname{\coloneqq}\begin{cases}10^{(Q+\lfloor{\log_{10}S}\rfloor+1)}&\text{if $i=j$}\\ 0&\text{otherwise}\end{cases},

(9)

and for each $i\in[N]$ and $s\in[S]$ , we have

\sigma^{2}_{(i-1)\cdot S+s}(x)=x+s.

(10)

The design of $\mathbf{W}^{2}$ ensures that sufficient bits are available at the low side, and the activation function $\sigma^{2}$ then writes the cascade identity into the empty bits (Remark 1 in Figure 2).

Using the information in $\mathbf{H}^{3}$ , for a node $v_{i}\in V$ , the largest influence summation can be determined by the sub-array

\Big{[}\mathbf{H}^{3}_{(i-1)\cdot S},\mathbf{H}^{3}_{(i-1)\cdot S+1},...,\mathbf{H}^{3}_{(i-1)\cdot S+S}\Big{]}

through pairwise comparisons. For the comparison between two cascades $s_{1}$ and $s_{2}$ at node $v_{i}$ ( $s_{1}<s_{2}$ ), as illustrated in Remark 2 in Figure 2, the difference

\mathbf{H}^{3}_{(i-1)\cdot S+s_{1}}-\mathbf{H}^{3}_{(i-1)\cdot S+s_{2}}

is first fed into a ReLu function of which the result is then added by $\mathbf{H}^{3}_{(i-1)\cdot S+s_{2}}$ . We can verify that this returns exactly

\max(\mathbf{H}^{3}_{(i-1)\cdot S+s_{1}},\mathbf{H}^{3}_{(i-1)\cdot S+s_{2}}).

Therefore, two layers are used to eliminate a half of the candidate cascades, thereby resulting in $2Z$ layers in total. The output of this part is $\mathbf{H}^{2Z+3}\in\mathbb{R}^{N}$ in which we have

\displaystyle\mathbf{H}^{2Z+3}_{i}=\max\Big{(}\mathbf{H}^{3}_{(i-1)\cdot S},\mathbf{H}^{3}_{(i-1)\cdot S+1},...,\mathbf{H}^{3}_{(i-1)\cdot S+S}\Big{)}

for each $i\in[N]$ . Notably, the lowest $Q$ bits in $\mathbf{H}^{2Z+3}$ stores the cascade index, which can be recovered by

\displaystyle\mathbf{H}^{2Z+4}=\sigma^{2Z+3}(\mathbf{W}^{2Z+3}\mathbf{H}^{2Z+3}),

(11)

where $\mathbf{W}^{2Z+3}$ is the identity matrix and we have

\sigma^{2Z+3}(x)\colon=x\mod 10^{\lfloor{\log_{10}S}\rfloor+1}.

(12)

Finally, the cascade index is converted to binary indicators through two layers. The first layer XXX and be achieved by

\displaystyle\mathbf{H}^{2Z+5}=\sigma^{2Z+4}(\mathbf{W}^{2Z+4}\mathbf{H}^{2Z+4}),

(13)

where for each $i,j\in[N]$ and $s\in[S]$ , we have

\mathbf{W}^{2Z+4}_{i,(j-1)\cdot S+s}\operatorname{\coloneqq}\begin{cases}1&\text{if $i=j$}\\ 0&\text{otherwise}\end{cases},

(14)

and

\sigma^{2Z+4}_{(i-1)\cdot S+s}(x)\operatorname{\coloneqq}\begin{cases}1&\text{if $x>=s$}\\ 0&\text{otherwise}\end{cases}.

(15)

The second layer is used to XXX and can be implemented through

\displaystyle\mathbf{H}^{2Z+6}=\sigma^{2Z+5}(\mathbf{W}^{2Z+5}\mathbf{H}^{2Z+5}),

(16)

where for each $i,j\in[N]$ and $s_{1},s_{2}\in[S]$ , we have

\mathbf{W}^{2Z+5}_{(i-1)\cdot S+s_{1},(j-1)\cdot S+s_{2}}\operatorname{\coloneqq}\begin{cases}1&\text{if $i=j$ and $s_{1}\leq s_{2}$}\\ -1&\text{if $i=j$ and $s_{1}=s_{2}$}\\ 0&\text{otherwise}\end{cases},

(17)

and

\sigma^{2Z+5}_{(i-1)\cdot S+s}(x)\operatorname{\coloneqq}\begin{cases}1&\text{if $x>=s$}\\ 0&\text{otherwise}\end{cases}.

(18)

$\mathbf{H}^{2Z+6}$ is exactly the one-step diffusion result after $\mathbf{H}^{1}$ . Repeating such one-step simulations for $N$ steps, the CLT model can be explicitly simulated. One important property is that the entire neural network is composed by piecewise polynomial activation functions, which will used in deriving its the sample complexity. By scrutiny, the following result summarizes the network structure.

Lemma 1.

A CLT model can be simulated by a feed-forward neural network composed of $O(S\cdot(N+M))$ adjustable weights, $O(N\cdot Z)$ layers, and $O(N^{2}\cdot S)$ piecewise linear computation units each with $O(S\cdot 2^{Q})$ pieces.

3.2 Efficient ERM

Taking the weights $w_{i,j}^{s}$ and the thresholds $\theta_{i}^{s}$ as parameters, the neural networks designed in the last subsection form a hypothesis space denoted by $\operatorname{\mathcal{F}}$ . Theorem 1 suggests that for any sample set $D$ , there always exists a perfect hypothesis in $\operatorname{\mathcal{F}}$ . In what follows, we show that such an empirical risk minimization solution can be efficiently computed. It is sufficient to find the parameters than can ensure the output of each activation function coincides with the diffusion result. Since the activation functions are all pairwise linear functions, the searching process can be done by a linear programming with polynomial number of constraints. Formally, we have the following statement.

Lemma 2.

For a CLT model and each sample set $D$ , there exists a hypothesis in $\operatorname{\mathcal{F}}$ with zero training error, and it can be computed in polynomial time in terms of $N,M,S$ , and $|D|$ by solving the following linear programming:

\min_{w_{u^{i}}\in\mathbb{R}^{|2N(u^{i})|},\xi>0}\sum_{k=1}^{K}\sum_{i=1}^{n}\sum_{t=1}^{n}\sum_{s=1}^{S}\xi^{k}_{t,i,1,s}+\xi^{k}_{t,i,2,s}

(19)

Subject to:

	$\displaystyle(2(I_{t,i,1,s}^{k})-1)(\sum_{u_{j}\in N(u_{i})}w_{(i,j)}^{s}\mathbf{I}_{t-1,i,2,s}^{k}-\theta_{i}^{s})\geq\hskip 128.0374pt$
	$\displaystyle\hskip 170.71652pt1-\xi^{k}_{t,i,1,s}$

	$\displaystyle\sum_{u_{j}\in N(u_{i})}w_{(i,j)}^{s^{*}}\mathbf{I}_{t-1,i,2,s}^{k}-\sum_{u_{j}\in N(u_{i})}w_{(i,j)}^{s}\mathbf{I}_{t-1,i,2,s}^{k}\geq\hskip 128.0374pt$
	$\displaystyle\hskip 170.71652pt0-\xi^{k}_{t,1,2,s}$

3.3 Generalization Performance

The learnability of class $\operatorname{\mathcal{F}}$ is enabled by the analysis of its VC dimension.

Lemma 3.

The VC dimension of class $F$ is $\operatorname{VC}(\mathcal{F})=\tilde{O}(S(N+M)Q)$ .

proof sketch.

The proof is obtained by first showing that the layers for one-step simulation has manageable VC dimension. Given the fact that repeating such one-step simulations does not increase the model capacity in terms of shattering points, the VC dimension of the entire neural network can be derived immediately. ∎

Together with Lemma 2, the following result summarizes our theoretical findings.

Theorem 1.

The influence function class $\mathcal{F}$ is PAC learnable and the sample complexity is $O(\frac{\operatorname{VC}(\mathcal{F})log(1/\epsilon)+log(1/\delta)}{\epsilon})$ . Please state it in a following way: given the parameter $\epsilon$ and $\delta$ ,

proof sketch.

Following empirical risk minimization learning strategy, we can always find a group of parameters that let the empirical risk equal to $0$ and that guarantee a realizable case. Based on the The fundamental theorem of statistical learning ? and with a finite VC dimension of $\mathcal{F_{u_{i}^{s}}}$ , we can obtain that $\mathcal{F}$ is PAC learnable with sample complexity $O(\frac{\operatorname{VC}(\mathcal{F})log(1/\epsilon)+log(1/\delta)}{\epsilon})$ . To this end, the sample complexity of $\mathcal{F}$ is a union bound over all $NS$ node. ∎

Remark 1.

(Please discuss the case $S=2$ here. The main point is that your results recover the existing ones for S=2.)

Given two competitive cascades, we obtain the same bound for the VC-dimension of the influence function under the LT model with single cascade narasimhan2015learnability. The structure for the corresponding neural network is slightly different from our prescribed neural network. For each diffusion step, it contains three layers with all $2N$ nodes, as illustrated in Figure 3.

The architecture of phase 1 is same as prescribed neural network. For phase 2,

\displaystyle\mathbf{H}^{3}=\sigma^{2}(\mathbf{W}^{2}\mathbf{H}^{2}),

(20)

where for each $i,j\in[N]$ and $s_{1},s_{2}\in[2]$ , we have

\mathbf{W}^{2}_{(i-1)\cdot S+s_{1},(j-1)\cdot S+s_{2}}\operatorname{\coloneqq}\begin{cases}1&\text{if $i=j$ and $s_{1}=s_{2}$}\\ -1&\text{if $i=j$ and $|s_{1}-s_{2}|=1$}\\ 0&\text{otherwise}\end{cases},

(21)

and $\sigma^{2}$ is a linear threshold function.

$H^{3}$ is exactly the graph status after phase 2. Therefore, the CLT model under two competitive cascades in one step can also be implement by a neural network with $3$ layers, $2(N+M)$ adjustable parameters and $4N$ computation units. Every output nodes is a linear threshold unit and the computational units are piecewise polynomial with $2$ pieces and degree $1$ in each piece. Based on theorem LABEL:, with fixed pieces and degree, we can obtain $\operatorname{VC}(\mathcal{F}_{i,1})=O((M+N)log(M+N))$ for node $u^{i}$ for one step. With a shared parameter, when $t>3$ , follow the same analysis in Lemma LABEL:, we have $\operatorname{VC}(\mathcal{F}_{i,t})=\tilde{O}((M+N))$ .

(Pleas organize the appendix so that we have proofs only to Theorem 1, lemma 1, lemma 2 corollary 1.)

4 Experiment

In this section, we present empirical studies on a collection of datasets, with goal to evaluate our algorithms against different baselines in two sample types. We aim to explore the impact of number of samples needed by the proposed learning algorithms to achieve reasonable performance.

4.1 Data

Synthetic data

We use two group synthetic graph data: a Kronecker graph with $1024$ nodes and $2655$ edges and a power-law graph with $768$ nodes and $1532$ edges. Following the CLT model, for each cascade, we set the weight on the each edge $(u,,v)$ as $w_{u,v}^{i}=\frac{1}{d_{v}+s/10}$ , where $d_{v}$ denotes the in-degree of node $v$ . For each node $u$ , we set the threshold for each cascade $\theta_{u}^{i}$ is generated uniformly at random from $[0,1]$ .

Real data

We apply our approaches the hashtag cascades data in the Gab network, which was collected from August 2016 to October 2018. The original network includes more than $50,000$ users and $11,467,918$ edges. $861,636$ different hashtags are recorded. Each hashtag can be seen as a cascade. To obtain the cascades with competitive relationships, we choose the hashtags with a relatively high overlap of forwarding users during a fixed period. In our experiments, we select hashtags ’trump’ and ’obama, since more than 30 percent of users both forwarded them from 2017 to 2018. And then we extract the corresponding social graph based on the users involved in these selected hashtags. The seed sets are the users who forward the cascades at the beginning of our collection. We set the time interval between each step as one day.

4.2 Experiment settings

Sample generating and CLT model

For each graph, we set $3$ cascades propagate simultaneously. In each sample, the total number of the seeds is set as $0.1|V|\leq|\bigcup_{i}^{s}\psi_{0}^{i}|\leq 0.5|V|$ . And each seed set $\psi_{0}^{i}$ is selected uniformly at random over all subset of $|V|$ . For each sample, the graph status during the diffusion step and the diffusion steps are record. For each graph, we collect $2000$ samples.

Evaluation metrics

We use four metrics to evaluate each learning approaches: precision, recall, $F-1$ score and accuracy.

The experiments contains two groups.

The training samples are randomly selected in different sizes: $50,100,500,1000$ . The test data size is fixed at $500$ .

•

Propose method: We use linear programming approach.
•
Baseline methods: We use multi-class classification supervised learning algorithms. For each node, we train a model to predict the node status after one time step $t$ . For all training samples, the training pair is defined as $(\mathbf{I}_{t-1,:,2::},\mathbf{I}_{t,:,2::})$ . In testing We set diffusion steps in different number ${1,2,3,4,5}$ to evaluate the performance of learned model.
- –
  
  Logistic regression model: The probability of a node in status $i$ is $\frac{\exp(w_{(u,v)}^{i}\mathbb{I}_{t-1,i,2,s}-\theta_{u}^{i})}{1+\exp(w_{(u,v)}^{i}\mathbb{I}_{t-1,i,2,s}-\theta_{u}^{i})}$ . The prediction status is cascade $i$ with the maximize probability.
- –
  
  SVM model: This method implement the “one-versus-rest” multi-class strategy.
- –
  
  MLP model: With all the ReLu activation functions, We set the number of hidden layers is 100 and the learning rate is $0.001$ .
•

We also do a random guess experiment.

4.3 Observation

In this section we list our main results in three table. Additional experiments results on power-law graph are in ?.

Overall observations

We compare the prediction performance of LP with other standard supervised machine learning algorithms. The main outcome on Kronecker graph and Gab cascade data is shown in Table 1, where the contents in each cell shows the corresponding metric with its standard deviation. The results show that the LP outperforms other methods with more training samples are given. With sample size $500$ , LP has already showed an accuracy with $0.99$ .

train_size		100				500
algorithm	iteration	f_1 score	precision	recall	accuracy	f_1 score	precision	recall	accuracy
LP	-	0.994/0.001	0.995/0.000	0.995/0.001	0.993/0.000	1.000/0.000	0.999/0.000	1.000/0.000	0.999/0.000
LR	1	0.499/0.001	0.676/0.002	0.452/0.001	0.709/0.002	0.509/0.001	0.732/0.001	0.456/0.001	0.719/0.004
	2	0.474/0.001	0.752/0.012	0.425/0.002	0.705/0.004	0.473/0.002	0.811/0.004	0.421/0.002	0.708/0.003
	3	0.460/0.003	0.802/0.010	0.412/0.002	0.705/0.006	0.454/0.001	0.852/0.003	0.406/0.001	0.703/0.005
	4	0.452/0.003	0.827/0.005	0.405/0.002	0.701/0.004	0.449/0.001	0.867/0.004	0.401/0.001	0.703/0.005
	5	0.451/0.003	0.832/0.006	0.404/0.002	0.701/0.004	0.446/0.002	0.876/0.003	0.400/0.001	0.703/0.003
SVM	1	0.556/0.002	0.716/0.002	0.502/0.002	0.727/0.003	0.583/0.001	0.797/0.002	0.517/0.001	0.747/0.003
	2	0.496/0.001	0.731/0.003	0.446/0.001	0.705/0.004	0.504/0.002	0.805/0.004	0.447/0.001	0.713/0.006
	3	0.480/0.003	0.770/0.009	0.430/0.002	0.702/0.005	0.483/0.002	0.834/0.001	0.429/0.001	0.706/0.004
	4	0.470/0.004	0.798/0.009	0.421/0.003	0.697/0.003	0.473/0.001	0.855/0.002	0.421/0.001	0.701/0.004
	5	0.465/0.003	0.818/0.006	0.417/0.003	0.698/0.006	0.468/0.003	0.863/0.002	0.417/0.001	0.700/0.006
MLP	1	0.909/0.001	0.975/0.000	0.860/0.001	0.935/0.000	0.909/0.001	0.975/0.000	0.860/0.001	0.935/0.000
	2	0.909/0.001	0.975/0.000	0.861/0.001	0.934/0.001	0.908/0.001	0.975/0.000	0.859/0.001	0.935/0.001
	3	0.909/0.001	0.975/0.000	0.861/0.001	0.935/0.001	0.909/0.000	0.975/0.000	0.860/0.001	0.934/0.001
	4	0.909/0.001	0.975/0.001	0.861/0.002	0.935/0.001	0.909/0.001	0.975/0.000	0.860/0.001	0.934/0.001
	5	0.909/0.001	0.975/0.000	0.860/0.001	0.934/0.000	0.909/0.000	0.975/0.000	0.860/0.001	0.935/0.001
random guess	-	0.209/0.002	0.250/0.002	0.250/0.004	0.250/0.002	0.210/0.001	0.250/0.001	0.250/0.001	0.250/0.001

Table 1: Main results under the full observation of Kronecker graph

train_size		100				500
algorithm	iteration	f_1 score	precision	recall	accuracy	f_1 score	precision	recall	accuracy
LP	-	0.994/0.001	0.995/0.000	0.995/0.001	0.994/0.000	1.000/0.000	0.999/0.000	1.000/0.000	0.999/0.000
LR	1	0.207/0.002	0.292/0.009	0.259/0.000	0.618/0.012	0.207/0.002	0.288/0.008	0.260/0.001	0.619/0.003
	2	0.208/0.005	0.283/0.005	0.261/0.002	0.609/0.002	0.205/0.001	0.293/0.004	0.259/0.001	0.609/0.010
	3	0.209/0.001	0.293/0.011	0.261/0.001	0.610/0.006	0.206/0.002	0.291/0.016	0.260/0.001	0.613/0.002
	4	0.205/0.004	0.294/0.009	0.259/0.001	0.611/0.012	0.206/0.004	0.293/0.003	0.260/0.002	0.609/0.009
	5	0.206/0.000	0.284/0.009	0.259/0.000	0.608/0.001	0.206/0.000	0.285/0.007	0.259/0.001	0.611/0.006
SVM	1	0.230/0.002	0.281/0.017	0.274/0.001	0.622/0.002	0.228/0.002	0.257/0.002	0.274/0.001	0.626/0.007
	2	0.231/0.002	0.286/0.005	0.275/0.001	0.618/0.001	0.229/0.001	0.258/0.002	0.275/0.001	0.624/0.001
	3	0.226/0.002	0.272/0.003	0.273/0.001	0.617/0.006	0.229/0.003	0.263/0.001	0.275/0.000	0.620/0.012
	4	0.230/0.000	0.278/0.004	0.275/0.001	0.617/0.001	0.231/0.000	0.268/0.004	0.277/0.001	0.623/0.005
	5	0.231/0.001	0.280/0.001	0.276/0.002	0.620/0.012	0.230/0.001	0.265/0.004	0.277/0.001	0.619/0.003
MLP	1	0.230/0.001	0.272/0.006	0.274/0.001	0.621/0.000	0.231/0.004	0.268/0.010	0.277/0.002	0.621/0.003
	2	0.233/0.002	0.286/0.002	0.275/0.001	0.628/0.012	0.231/0.004	0.269/0.001	0.276/0.002	0.629/0.003
	3	0.230/0.003	0.272/0.012	0.274/0.002	0.622/0.007	0.230/0.002	0.273/0.000	0.276/0.002	0.623/0.004
	4	0.227/0.005	0.274/0.004	0.272/0.002	0.617/0.002	0.229/0.002	0.284/0.016	0.274/0.001	0.621/0.001
	5	0.230/0.001	0.280/0.013	0.274/0.000	0.619/0.004	0.228/0.001	0.261/0.002	0.275/0.001	0.620/0.004
random guess	-	0.214/0.002	0.251/0.002	0.251/0.003	0.251/0.001	0.213/0.000	0.250/0.000	0.250/0.001	0.250/0.000

Table 2: Main results of Power-law graph

Impact of the number of cascades

As our work focuses on the CLT model, we are interested in the performance of our proposed methods in different number of cascades. The Table 5 and 5 shows the experiment results under two data types.

5 Conclusion

Appendix A Proofs

A.1 Proof of Theorem 1

(Please revise the proof of Theorem 1: for each $sigma^{l}$ , please explicitly state what kind of function they are; please explicitly count the numbers you have in the theorem.)

We design a neural network to simulate the diffusion process under CLT model. For each diffusion step, we

Phase 1

Phase 2

A.2 Proof of Lemma 1

please follow the following structure of the proof: for each time step transformation $\mathbf{I}_{t,:,:,2}$ to $\mathbf{I}_{t+1,:,:,2}$ , explicitly write down the constraint given by each activation function in Figure 2; after that, summarize all the constraints into one linear programming

The linear programming approach shows that ERM is efficient learnable from the sample data. Given the sample set $S$ , during the diffusion process, the observed activation status in sample $k\in[K]$ after time step $t$ is given by $\mathbf{I}_{t-1,:,2,:}^{k}$ . The loss of an influence function $\mathbf{F}$ for the sample $k$ is given by

\displaystyle\sum_{t=1}^{n}\operatorname{\mathcal{\ell}_{0-1}}(\mathbf{I}_{t,:,2,:}^{k},\mathbf{F}(\mathbf{I}_{t-1,:,2,:}^{k}))

(22)

In the prescribed neural network with one diffusion step $t$ , both output after phase 1 and phase 2 can be seen as a binary vector with size $NS$ . We first focus on each output node $u_{i}^{s}$ .

Phase 1

Given the graph status $\mathbf{I}_{t-1,:,2,:}$ after step $t-1$ , let $g_{t,i,1,s}(\mathbf{I}_{t-1,N(u_{i})^{s},2,s},\mathbf{W}_{i}^{s}):\{0,1\}^{2|N(u_{i})|}\to\{0,1\}$ denote the computation of the output node $u_{i}^{s}$ after phase 1, and $g_{t,i,1,s}(\mathbf{I}_{t-1,N(u_{i})^{s},2,s},\mathbf{W}_{i}^{s})=1(\zeta_{i}^{s}\geq\theta_{i}^{s})$ , where $\mathbf{W}_{i}^{s}$ is the parameters related to node $u_{i}^{s}$ . Then, the local prediction error can be defined as

\displaystyle\hat{\mathbf{L}}_{t,i,1,s}(\mathbf{W}_{i}^{s})=\frac{1}{K}\sum_{k=1}^{K}\boldsymbol{1}[\mathbf{I}_{t,i,2,s}^{j}\neq g_{t,i,1,s}(\mathbf{I}_{t-1,N(u_{i}),2,s},\mathbf{W}_{i}^{s})]

(23)

During the phase 1 diffusion process in each step, the propagation for each cascade can be seen as $S$ cascades diffused independently according to the linear threshold model (And the neural network during phase 1 is a linear threshold network). Therefore, there always exists parameters such that $\hat{\mathbf{L}}_{t,i,1,s}(\mathbf{W}_{i}^{s})=0$ after phase 1.

Phase 2

In phase 2, the comparison process is happened without the unknown parameters. Let $g_{t,i,2,s}(\mathbf{I}_{t,N(u)^{s},1,s}\to\{0,1\}$ denote the computation of the output node $u_{i}^{s}$ after phase 1. Then, the local prediction error can be defined as

\displaystyle\hat{\mathbf{L}}_{t,i,2,s}=\frac{1}{K}\sum_{k=1}^{K}\boldsymbol{1}[\mathbf{I}_{t,i,2,s}^{j}\neq g_{t,i,2,s}(\mathbf{I}_{t,N(u_{i}),1,s})]

(24)

Formulating the LP

Our learning problem can be decoupled into a group of linear programmings for each time step in each samples. Our optimization problem is over the variables $\mathbf{W}\in\mathbb{R}^{S(N+M)}$ and a groups of slack variables $\xi_{t,i,p,s}\in\mathbb{R}$ for $t\in[N],i\in[N],p\in\{1,2\}$ and $s\in[S]$ , which is designed to help us formulate the objective function. Therefore,

\min_{w_{u^{i}}\in\mathbb{R}^{|2N(u^{i})|},\xi>0}\sum_{k=1}^{K}\sum_{i=1}^{n}\sum_{t=1}^{n}\sum_{s=1}^{S}\xi^{k}_{t,i,1,s}+\xi^{k}_{t,i,2,s}

(25)

Subject to:

	$\displaystyle(2(I_{t,i,1,s}^{k})-1)(\sum_{u_{j}\in N(u_{i})}w_{(i,j)}^{s}\mathbf{I}_{t-1,i,2,s}^{k}-\theta_{i}^{s})\geq\hskip 128.0374pt$
	$\displaystyle\hskip 170.71652pt1-\xi^{k}_{t,i,1,s}$

	$\displaystyle\sum_{u_{j}\in N(u_{i})}w_{(i,j)}^{s^{*}}\mathbf{I}_{t-1,i,2,s}^{k}-\sum_{u_{j}\in N(u_{i})}w_{(i,j)}^{s}\mathbf{I}_{t-1,i,2,s}^{k}\geq\hskip 128.0374pt$
	$\displaystyle\hskip 170.71652pt0-\xi^{k}_{t,1,2,s}$

where $\mathbf{I}_{t,i,2,s^{*}}=1$

A.3 Proof of Lemma 1 ( the numbers should be generated by ref rather than hard coded)

The realizable assumption on hypothesis class $\mathcal{H}$ is that there is a $h\in\mathcal{H}$ , such that given a sample set $S=\{(x_{1},y_{1}),...,(x_{m},y_{m})\}$ , $h(x_{i})=y_{i}$ . Given the influence function class $\mathcal{F}$ , the samples are generated from the distribution of $\mathbf{I}_{0}$ and a target function $f$ . In our setting, we care about the function $f$ . Therefore, given our designed neural network model with a finite VC dimension, following the ERM strategy, we can obtain a probably approximately model, which also called restricted model. Thus, there always be a function $f\in\mathcal{F}$ consist with samples bartlett1999hardness. Therefore, suppose that parameters $W^{*}$ is the underlying parameters of CLT model, the following equation is always satisfied:

\inf_{\mathbf{W}\in\mathbb{R}_{+}^{S(N+M)}}\operatorname{\mathcal{\ell}_{0-1}}(\mathcal{F})=\operatorname{\mathcal{\ell}_{0-1}}(\mathcal{F}^{W^{*}})=0

(26)

This complete the proof.

A.4 Proof of Lemma 2

The following theorem gives the VC dimension of the class of neural networks with all piecewise-polynomial units.

Theorem 2.

? Suppose $N$ is a feed-forward network with a total of $W$ weights and $k$ computational units, in which the output unit is a linear threshold unit and every other computation unit has a piecewise-polynomial activation function with $p$ pieces and degree no more than $l$ . Suppose in addition that the computation units in the network are arranged in $L$ layers, so that each unit has connections only from units in earlier layers. Then if $H$ is the class of functions computed by $N$ ,

	$\displaystyle\operatorname{VC}(H)\leq$	$\displaystyle 2WL\log_{2}(4WLpk/\ln{2})+$
	$\displaystyle+$	$\displaystyle 2WL^{2}\log_{2}(l+1)+2L$		(27)

for a fixed $p$ and $l$ , $\operatorname{VC}(H)=O(WL\log_{2}W+WL^{2})$

Note that the influence function class $\mathcal{F}$ is a set of all possible neural network we designed in section 3. We first consider the neural networks with single output $u_{i}^{s}$ . We use $\mathbf{F}_{u^{s}_{i},t}^{w}(\mathbf{I}_{0}):\{0,1\}^{sn}\to\ \{0,1\}$ to denote the computation at the output node $u^{i}$ after time step $t$ , where $w\in\mathbb{R}^{s(n+m)}$ . Based on the architecture of the neural network, for any $t$ , the attributes the corresponding neural network can be summarized in Table 3.

parameters	symbol	diffusion step $t$
layers	$l$	$t(2Z+5)$
maximum pieces of all units	$p$	$S(9^{q}-1)$
maximum degree of all pieces	$d$	$1$
computational units	$k$	$(8S-3)Nt$
adjustable parameters	$w$	$S(N+M)$

Table 3: The parameters of neural network with one output

u^{i}

Therefore, we can obtain the upper bound of the VC dimension when $t=1$ with one output $u_{i}^{s}$ , and denote as $\operatorname{VC}(\mathbf{F}_{u^{s}_{i},1}^{w})=\tilde{O}(S(N+M)Q)$ . Follow this idea, if we ignore the same parameters during each time step, the total number of unknown parameters in the global neural network would be $N(S(N+M))$ . And according to theorem LABEL:theorem:VC, we will obtain a VC dimension bound of $\tilde{O}(NS(N+M)Q)$ . However, all the parameters we want learn are shared during the global diffusion process in $\mathcal{N}$ , which leads to a lower ability of shattering a subset of points for that neural network. Therefore we can obtain a tighter bound of VC dimension for influence function class $\mathbf{F}_{u^{s}_{i},t}$ when $t>2$ .

To see this, we first calculate the $\operatorname{VC}(\mathbf{F}_{u^{s}_{i},2}^{w})$ , which is $\tilde{O}(S(N+M)Q)$ . Now we prove that the $VCdim(\mathcal{G}_{u^{s}_{i},t}^{w})\leq\operatorname{VC}(\mathbf{F}_{u^{s}_{i},2}^{w})$ when $t>2$ . Consider a set of $NS$ points $\{X_{1},...,X_{NS}\}\subseteq\{0,1\}^{NS}$ shattered by $\mathcal{F}_{u^{i}_{s},t}^{w}$ , where $t>2$ . From the definition of shattered set,we have:

\begin{split}2^{NS}&=|\{F^{w}_{u^{s}_{i},t}(X_{1}),...,F^{w}_{u_{i}^{s},t}(X_{NS})|\\ &=|\{F^{w}_{u_{i}^{s},t-1}(F^{w}_{1,1}(X_{1}),...,F^{w}_{1,N}(X_{1})),...,\\ &\ \ \ \ \ \ F^{w}_{u^{i},t-1}(F^{w}_{1,1}(X_{NS}),...,F^{w}_{1,N}(X_{NS}))|\\ &=|\{F^{w}_{u^{s}_{i},t-1}(\mathbf{Z}_{1}),...,F^{w}_{u^{s}_{i},t-1}(\mathbf{Z}_{NS})|\\ \end{split}

(28)

where $\mathbf{Z}_{1}=[F^{w}_{1,1}(X_{1}),...,F^{w}_{1,N}(X_{1})],...,\mathbf{Z}_{NS}=F^{w}_{1,1}(X_{NS}),...,F^{w}_{1,N}(X_{NS})\in{0,1}^{S}N$

From above equation, we have

\displaystyle|\{F^{w}_{u^{i},t-1}(\mathbf{Z}_{1}),...,F^{w}_{u^{i},t-1}(\mathbf{Z}_{sn})|=2^{V^{\prime}}

(29)

We could see that the $\mathbf{Z}_{1},...,\mathbf{Z}_{NS}$ have to be different so that all the patterns of of $NS$ nodes can be realized. This implies that the set $\mathbf{Z}_{1},...,\mathbf{Z}_{NS}$ is shattered by $\mathcal{F}^{w}_{u^{s}_{i}}$ in time step $t-1$ . Thus for any set of points of a given size shattered by $\mathcal{F}_{u^{s}_{i}}$ when $t>2$ , there exists a set of points of the same size shattered by $\mathcal{F}_{u^{s}_{i}}$ in time step $t-1$ . Therefore:

\operatorname{VC}(\mathcal{F}_{u^{s}_{i},t})\leq\operatorname{VC}(\mathcal{F}_{u^{s}_{i},t-1})

(30)

Hence, $\operatorname{VC}(\mathcal{F}_{u^{i}})=\tilde{O}(s(n+r)q)$ . The result is summarized in Theorem LABEL:theorem:VC.

A.5 Proof of corollary 1

Theorem 3.

? Let $\mathcal{H}$ be a hypothesis class of functions from a domain $\mathcal{X}$ to $\{0,1\}$ and let the loss function be the $0-1$ loss. Then, the following are equivalent:

•

$\mathcal{H}$ has the uniform convergence property.
•

Any ERM rule is a successful agnostic PAC learner for $\mathcal{H}$ .
•

$\mathcal{H}$ is agnostic PAC learnable.
•

$\mathcal{H}$ is PAC learnable.
•

Any ERM rule is a successful PAC learner for $\mathcal{H}$
•

$\mathcal{H}$ has a finite VC dimension

Theorem 4.

Let $\mathcal{H}$ be a hypothesis class of functions from a domain $\mathcal{X}$ to $\{0,1\}$ and let the loss function be the $0-1$ loss. Assume that $\operatorname{VC}(\mathcal{H})=d<\infty$ . Then, there are absolute constants $C_{1},C_{2}$ such that $\mathcal{H}$ is PAC learnable with sample complexity:

\displaystyle C_{1}\frac{d+\log(1/\delta)}{\epsilon}\leq m_{\mathcal{H}}(\epsilon,\delta)\leq C_{2}\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}

(31)

Therefore, based on the Fundamental Theorem of Statistical Learning, the influence function $\mathcal{F}_{u^{i}_{s}}$ is PAC learnable with sample complexity $m_{\mathcal{F}_{u_{i}^{s}}}$ for any $\delta,\epsilon\in(0,1)$ such that with confidence $1-\delta$ , the generalization error $L(\hat{F})\leq\epsilon$ , where

m_{\mathcal{F}_{u_{i}^{s}}}=O(\frac{\operatorname{VC}(\mathcal{F})log(1/\epsilon)+log(n/\delta)}{\epsilon})

(32)

By taking a union bound over all $NS$ nodes, we can get $m_{\mathcal{F}}=O(\frac{\operatorname{VC}(\mathcal{F})log(1/\epsilon)+log(n/\delta)}{\epsilon})$ .

Appendix B Additional materials for experiments

train_size		50				100				500				1000
algorithm	iteration	f_1 score	precision	recall	accuracy	f_1 score	precision	recall	accuracy	f_1 score	precision	recall	accuracy	f_1 score	precision	recall	accuracy
LP	-	0.991/0.001	0.993/0.000	0.989/0.001	0.902/0.001	0.994/0.001	0.995/0.000	0.995/0.001	0.993/0.000	1.000/0.000	0.999/0.000	1.000/0.000	0.999/0.000
LR	1	0.488/0.002	0.622/0.002	0.448/0.002	0.695/0.002	0.499/0.001	0.676/0.002	0.452/0.001	0.709/0.002	0.509/0.001	0.732/0.001	0.456/0.001	0.719/0.004	0.508/0.001	0.732/0.002	0.455/0.001	0.718/0.005
	2	0.471/0.002	0.668/0.011	0.427/0.002	0.697/0.004	0.474/0.001	0.752/0.012	0.425/0.002	0.705/0.004	0.473/0.002	0.811/0.004	0.421/0.002	0.708/0.003	0.472/0.001	0.816/0.004	0.421/0.001	0.708/0.005
	3	0.460/0.001	0.713/0.022	0.416/0.001	0.697/0.004	0.460/0.003	0.802/0.010	0.412/0.002	0.705/0.006	0.454/0.001	0.852/0.003	0.406/0.001	0.703/0.005	0.454/0.001	0.853/0.003	0.406/0.001	0.702/0.002
	4	0.458/0.004	0.743/0.019	0.413/0.004	0.698/0.006	0.452/0.003	0.827/0.005	0.405/0.002	0.701/0.004	0.449/0.001	0.867/0.004	0.401/0.001	0.703/0.005	0.448/0.001	0.874/0.003	0.401/0.001	0.700/0.006
	5	0.454/0.005	0.772/0.017	0.408/0.005	0.699/0.004	0.451/0.003	0.832/0.006	0.404/0.002	0.701/0.004	0.446/0.002	0.876/0.003	0.400/0.001	0.703/0.003	0.446/0.002	0.879/0.002	0.399/0.001	0.703/0.005
SVM	1	0.537/0.001	0.663/0.008	0.491/0.001	0.716/0.004	0.556/0.002	0.716/0.002	0.502/0.002	0.727/0.003	0.583/0.001	0.797/0.002	0.517/0.001	0.747/0.003	0.904/0.001	0.974/0.000	0.854/0.001	0.932/0.001
	2	0.488/0.001	0.674/0.005	0.443/0.001	0.695/0.002	0.496/0.001	0.731/0.003	0.446/0.001	0.705/0.004	0.504/0.002	0.805/0.004	0.447/0.001	0.713/0.006	0.904/0.001	0.975/0.000	0.854/0.001	0.933/0.001
	3	0.476/0.003	0.705/0.011	0.431/0.002	0.693/0.002	0.480/0.003	0.770/0.009	0.430/0.002	0.702/0.005	0.483/0.002	0.834/0.001	0.429/0.001	0.706/0.004	0.904/0.000	0.974/0.000	0.854/0.001	0.931/0.000
	4	0.471/0.004	0.734/0.017	0.425/0.004	0.696/0.002	0.470/0.004	0.798/0.009	0.421/0.003	0.697/0.003	0.473/0.001	0.855/0.002	0.421/0.001	0.701/0.004	0.905/0.001	0.974/0.000	0.855/0.001	0.932/0.000
	5	0.465/0.004	0.757/0.018	0.420/0.004	0.692/0.005	0.465/0.003	0.818/0.006	0.417/0.003	0.698/0.006	0.468/0.003	0.863/0.002	0.417/0.001	0.700/0.006	0.904/0.001	0.974/0.001	0.854/0.002	0.932/0.001
MLP	1	0.909/0.001	0.975/0.000	0.860/0.001	0.935/0.001	0.909/0.001	0.975/0.000	0.860/0.001	0.935/0.000	0.909/0.001	0.975/0.000	0.860/0.001	0.935/0.000	0.910/0.001	0.975/0.000	0.862/0.001	0.935/0.000
	2	0.909/0.001	0.975/0.001	0.862/0.001	0.934/0.001	0.909/0.001	0.975/0.000	0.861/0.001	0.934/0.001	0.908/0.001	0.975/0.000	0.859/0.001	0.935/0.001	0.908/0.001	0.975/0.000	0.859/0.001	0.935/0.000
	3	0.909/0.001	0.975/0.000	0.860/0.001	0.934/0.001	0.909/0.001	0.975/0.000	0.861/0.001	0.935/0.001	0.909/0.000	0.975/0.000	0.860/0.001	0.934/0.001	0.909/0.000	0.975/0.000	0.860/0.001	0.935/0.001
	4	0.909/0.000	0.975/0.000	0.860/0.000	0.935/0.001	0.909/0.001	0.975/0.001	0.861/0.002	0.935/0.001	0.909/0.001	0.975/0.000	0.860/0.001	0.934/0.001	0.909/0.000	0.975/0.000	0.861/0.000	0.935/0.001
	5	0.909/0.000	0.975/0.000	0.860/0.001	0.934/0.001	0.909/0.001	0.975/0.000	0.860/0.001	0.934/0.000	0.909/0.000	0.975/0.000	0.860/0.001	0.935/0.001	0.909/0.000	0.975/0.000	0.861/0.001	0.935/0.000
random guess	-	0.208/0.002	0.250/0.002	0.250/0.003	0.250/0.002	0.209/0.002	0.250/0.002	0.250/0.004	0.250/0.002	0.210/0.001	0.250/0.001	0.250/0.001	0.250/0.001	0.208/0.000	0.250/0.001	0.250/0.001	0.250/0.001

Table 4: Main results of Kronecker graph

train size		50				100				500
algorithm	number of cascades	f1_score	precision	recall	accuracy	f1_score	precision	recall	accuracy	f1_score	precision	recall	accuracy
LP	2
	4
	8
LR	2
	4
	8
SVM	2
	4
	8
NN	2
	4
	8

Table 5: Evaluate the performance on different number of cascades

parameters:

•

$V=\{u_{1},...,u_{N}\}$ , $u_{i}$ for $i\in[N]=\{1,...,N\}$
•

$[S]=\{1,..,S\}$ , $s$ -active for $s\in[S]$ ; $S=2^{n}$ for $n\in\mathbb{Z}^{+}$ .
•

sample: $k\in[K]$

B.0.1 Phase 1

•

$\mathbf{H^{1}}$ size: $NS$

•

$\mathbf{W}^{1}\in\mathbb{R}^{|H_{2}|\times|H_{1}|}(\text{for}W^{l}H^{l}),\mathbf{W}^{1}\in\mathbb{R}^{NS\times NS}$

W^{1}_{i,j}\colon=\begin{cases}w_{(\frac{i-1}{S}+1,\frac{j-1}{S}+1)}^{(\mod{\frac{i}{s}})}&\text{If $(u_{\frac{i-1}{S}+1},u_{\frac{j-1}{S}+1})\in E$}\\ 1&\text{If $\frac{i-1}{S}+1=\frac{j-1}{S}+1$}\\ 0&\text{Otherwise}\end{cases}

(33)

Or if I can use $(i^{s},j^{s})$ to index the nodes in first layer and second layer, I can use following equation:

W^{1}_{i^{s},j^{s}}\colon=\begin{cases}w_{(i,j)}^{s}&\text{If $(u_{i},u_{j})\in E$}\\ 1&\text{If $i=j$}\\ 0&\text{Otherwise}\end{cases}

(34)

•

$\mathbf{H^{2}}$ size: $NS$
•

$\sigma^{2}$ :

$\sigma^{2}_{i}(x)\colon=\begin{cases}x&\text{If $x\geq\theta_{\frac{i-1}{S}+1}^{\mod{\frac{i}{s}}}$}\\ 0&\text{Otherwise}\end{cases}$ (35)

Or if I can use $i^{s}$ to index the nodes in second layer, I can use following equation:

$\sigma^{2}_{i^{s}}(x)\colon=\begin{cases}x&\text{If $x\geq\theta_{u_{i}}^{s}$}\\ 0&\text{Otherwise}\end{cases}$ (36)

B.0.2 Phase 2

Pairwise comparison

•

Add cascade index
•

$\mathbf{W}^{2}\in\mathbb{R}^{NS\times NS}$

$W^{2}_{i,j}\colon=\begin{cases}10^{(q+\left\lceil{\frac{S}{10}}\right\rceil)}&\text{If $i=j$}\\ 0&\text{Otherwise}\end{cases}$ (37)
•

$\mathbf{H}^{3}$ size: $NS$
•

$\sigma^{3}$ :

$\sigma^{3}_{i}(x)=x+\mod{\frac{i}{S}}$ (38)
•

Begin comparison: the composition of these comparison layers can be seen as $N$ identical structure (each structure contains $S$ nodes) from the longitudinal observation. Among each identical structure, the idea is that we compare the node value between every two adjacent nodes and pass the larger value forward until there has only one node left. And the node size will be reduced by half for every two layers. Therefore, to complete the pairwise comparison process, $2n$ layers are needed. Furthermore, during each pairwise comparison(every two layers among $2n$ layers), the architecture is same.
•

$\mathbf{W}^{3}\in\mathbb{R}^{NS\times NS}$

$W^{3}_{i,j}=\begin{cases}1&\text{If $i=j$}\\ -1&\text{If $\mod{i/2}=0$ and $j=i-1$ }\\ 0&\text{Otherwise}\end{cases}$ (39)
•

$\mathbf{H}^{4}$ size: $NS$
•

$\sigma^{4}$ : ReLu

$\sigma^{4}(x)=max\{0,x\}$ (40)
•

$\mathbf{W}^{4}\in\mathbb{R}^{NS/2\times NS}$

$W^{4}_{i,j}=\begin{cases}1&\text{If $j=2i$ or $j=2i-1$}\\ 0&\text{Otherwise}\end{cases}$ (41)
•

$\mathbf{H}^{5}$ size: $NS/2$
•

$\sigma^{5}$ : Identity

$\sigma^{5}(x)=x$ (42)
•

$\mathbf{H}^{6}$ size: $NS/2$
•

$\sigma^{6}$ : Relu
•

$\mathbf{H}^{7}$ size: $NS/4$
•

$\sigma^{7}$ : Identity
•

…
•

$\mathbf{H}^{2n+3}$ size: $N$
•

$\sigma^{2n+3}$ : Identity

Status recovering

•

$\mathbf{W}^{2n+3}\in\mathbb{R}^{N\times N}$

$W^{2n+3}_{i,j}=\begin{cases}1&\text{If $i=j$}\\ 0&\text{Otherwise}\end{cases}$ (43)
•

$\mathbf{H}^{2n+4}$ size: $N$

•

$\sigma^{2n+4}$ :

\sigma^{2n+4}(x)\colon=x-\left\lfloor{\frac{x}{\left\lceil{\frac{S}{10}}\right\rceil}}\right\rfloor\times\left\lceil{\frac{S}{10}}\right\rceil

(44)

•

$\mathbf{W}^{2n+4}\in\mathbb{R}^{NS\times N}$

$W^{2n+4}_{i,j}=\begin{cases}1&\text{If $i=2j$ or $i=2j-1$}\\ 0&\text{Otherwise}\end{cases}$ (45)
•

$\mathbf{H}^{2n+5}$ size: $NS$

•

$\sigma^{2n+5}$ :

{}\sigma^{2x+4}_{i}(x)=\begin{cases}1&\text{If $x\in(mod(\frac{i}{S})-1,mod(\frac{i}{S})]$ }\\ 0&\text{Otherwise}\end{cases}

(46)

$\mathbf{1}_{st}$ layer

$N_{1}$ takes $\mathbf{I}_{0}$ as the input. For further illustration, we reshape the matrix $\mathbf{I}_{0}\in\{0,1\}^{n\times s}$ to a vector $\mathbf{I}_{0}\in\{0,1\}^{ns}$ . ( I think using vector format is more easier for me to describe the weight matrix clearly.) Let $H^{0}\in\{0,1\}^{ns}$ denote the input layer. The matrix $W^{0}\in[0,1]^{ns\times ns}$ indicates the weights that are associated with each edges in the graph. We use $u^{i}$ to index the nodes in input layer and $\mathbf{1}_{st}$ layer, where $u\in V$ and $i\in\{1,...,s\}$ . Furthermore, to keep the activated nodes remain active for the following diffusion process, an additional link with a fixed weight $+1$ will be added between each node in these two layers. The weight matrix $W^{1}$ is defined as follows.

W^{0}_{u^{i},v^{i}}\colon=\begin{cases}w_{(u,v)}^{i}&\text{If $(u,v)\in E$}\\ 1&\text{If $u=v$}\\ 0&\text{Otherwise}\end{cases}

(47)

The activation function $\sigma^{1}:\{0,1\}^{sn}\to[0,1]^{sn}$ is designed to execute the linear threshold function with the unknown thresholds. In order to simulate the phase $2$ diffusion process, it is crucial to record the weight summation value $\zeta^{i}_{u}$ for each cascades, which is exactly $W^{0}H^{0}_{u^{i}}$ . Therefore, the activation function $\sigma^{0}$ is defined as follows.

\sigma^{1}(W^{0}H^{0})_{u^{i}}\colon=\begin{cases}(W^{0}H^{0})_{u^{i}}&\text{If $(W^{0}H^{0})_{u^{i}}\geq\theta_{u}^{i}$}\\ 0&\text{Otherwise}\end{cases}

(48)

Figure 4 gives an example of the structure for this group with $2$ cascades. Obviously, the value of the nodes in the $\mathbf{1}_{st}$ layer indicates the graph status after phase $1$ .

Lemma 4.

For $i\in\{1,...,s\}$ , if $H^{1}_{u^{i}}>0$ , then node $u$ in graph is $i$ -active after phase $1$ ; otherwise inactive.

$\mathbf{2}_{nd}$ layer

This layer is designed to preserve the pre-activated cascade index $i$ . An additional binary node are added for each node in $\mathbf{1}_{st}$ layer. The matrix $W^{2}\in\mathbb{R}^{2ns\times ns}$ is fixed and defined as follows.

W^{2}_{j,k}\colon=\begin{cases}10^{(q+\left\lceil{\frac{s}{10}}\right\rceil)}&\text{If $j=2k$}\\ \mod{(k/s)}+1&\text{If $j=2k+1$}\\ 0&\text{Otherwise}\end{cases}

(49)

The activation function $\sigma^{2}$ is the identity function and defined as follows.

\sigma^{2}(X^{1})=X^{1}

(50)

B.0.3 Comparison layers

This group simulates the phase 2 diffusion process. Since the comparison among the weight summation $\zeta_{u}^{i}$ for each cascade is happened inside every $s$ nodes, the composition of these comparison layers can be seen as $n$ identical structure from the longitudinal observation. An example of one user with $2$ cascades is given in Figure 5. For every $s$ nodes, we construct a $2-2$ comparison structure. The idea is that we compare the values between every $2$ adjacent nodes and pass the larger value forward until there has only one node left. And the node size will be reduced by half for every $2$ layers. Therefore, to complete the $2-2$ comparison process, $2x$ layers are needed.

In the whole comparison layers for every $s$ nodes, the weight matrix is identical and fixed. Since the node size keep reducing by half for every $2$ layers, to simplify, we set the input size is $z^{l}\in\{2^{x},2^{x-1},...,1\}$ , for $l\in\{3,...,2x+2\}$ . If $l/2\neq 0$ , $W^{l}=\mathbb{R}^{z^{l}\times z^{l}}$ is defined in equation 51; otherwise, $W^{l}=\mathbb{R}^{z^{l}\times z^{l}/2}$ is in equation 52.

{}W^{l}_{j,k}=\begin{cases}1&\text{If $j=k$}\\ -1&\text{If $j=k+1$ and $j=2y$ for $y\in\mathbb{Z}^{+}$ }\\ 0&\text{Otherwise}\end{cases}

(51)

{}W^{l}_{j,k}=\begin{cases}1&\text{If $j=2k$ or $j=2k-1$ }\\ 0&\text{Otherwise}\end{cases}

(52)

ReLu and identity function are taking in turn among this group. When $l/2\neq 0$ , we use ReLu function; otherwise, we use identity function. For $l\in\{3,...,2x\}$ , the activation function is defined as follows.

\sigma^{l}=\begin{cases}\max\{Z^{l-1},0\}&\text{If $l/2\neq 0$ }\\ Z^{l-1}&\text{Otherwise}\end{cases}

(53)

Back to the whole structure of these $2x$ layers, by constructing $2-2$ comparison for every $s$ nodes, the output $H^{2x+2}\in\mathbb{R}_{+}^{n}$ is a vector with size $n$ . Each node in this layer can be seen as the node in the network. And its value indicates the activation status after phase 2 along with the corresponding weights summation.

B.0.4 Recover layers

This group contains $2$ layers and is designed to recover the format of the output. Figure 6 shows an example of the structure of this group.

$\mathbf{(2x+3)}_{th}$ layer

This layer is designed to extract the cascade index from the single digit of $H^{(}2x+2)$ . We use $u\in V$ to index each node. The weight matrix $W^{2x+3}=\{0,1\}^{n\times n}$ is defined as follows.

W^{2x+3}_{u,v}=\begin{cases}1&\text{If $u=v$}\\ 0&\text{Otherwise}\end{cases}

(54)

We use a step function (a constant polynomial with degree equal to zero) with $10^{q}s$ pieces to return the final status of each node in the graph and it is defined as follows.

\sigma^{2x+3}\colon=X^{2x+2}-\left\lfloor{\frac{X^{2x+2}}{\left\lceil{\frac{s}{10}}\right\rceil}}\right\rfloor\times\left\lceil{\frac{s}{10}}\right\rceil

(55)

$\mathbf{(2x+4)}_{th}$ layer

This layer is used to convert the format of the output layer as same as the $\mathbf{1}_{st}$ layer. And the output is $H^{4+k}=\{0,1\}^{sn}$ .Weight matrix $W^{2x+4}=\mathbb{R}^{n\times ns}$ is defined as follows.

W^{2x+4}_{u,u^{i}}=\begin{cases}1&\text{for $i\in\{1,...,s\}$}\\ 0&\text{Otherwise}\end{cases}

(56)

For each node $u^{i}$ in the output layer, we introduce a fixed threshold $\rho_{u^{i}}=(i-1,i]$ . The activation function $\sigma^{x+5}$ is defined as follows.

{}\sigma^{2x+4}_{u^{i}}=\begin{cases}1&\text{If $H^{10}_{u^{i}}\in\rho_{u^{i}}$ }\\ 0&\text{Otherwise}\end{cases}

(57)

Therefore, the output $H^{2x+4}$ of $N_{t}$ is the graph status after phase 2.

Lemma 5.

$H^{2x+4}=\mathbf{I}_{n,:,2,:}$

(Same issue with the format of I)

Learnability of Competitive Threshold Models

Abstract

1 Introduction

Motivation.

Contribution.

2 Preliminaries

2.1 Competitive linear threshold model

Definition 1.

Definition 2.

2.2 Learning settings

3 Learnability analysis

Assumption 1.

Assumption 2.

3.1 Realizable hypothesis space

3.1.1 Phase 1

3.1.2 Phase 2

Lemma 1.

3.2 Efficient ERM

Lemma 2.

3.3 Generalization Performance

Lemma 3.

proof sketch.

Theorem 1.

proof sketch.

Remark 1.

4 Experiment

4.1 Data

Synthetic data

Real data

4.2 Experiment settings

Sample generating and CLT model

Evaluation metrics

4.3 Observation

Overall observations

Impact of the number of cascades

5 Conclusion

Appendix A Proofs

A.1 Proof of Theorem 1

Phase 1

Phase 2

A.2 Proof of Lemma 1

Phase 1

Phase 2

Formulating the LP

A.3 Proof of Lemma 1 ( the numbers should be generated by ref rather than hard coded)

A.4 Proof of Lemma 2

Theorem 2.

A.5 Proof of corollary 1

Theorem 3.

Theorem 4.

Appendix B Additional materials for experiments

B.0.1 Phase 1

B.0.2 Phase 2

Pairwise comparison

Status recovering

𝟏s​t\mathbf{1}_{st} layer

Lemma 4.

𝟐n​d\mathbf{2}_{nd} layer

B.0.3 Comparison layers

B.0.4 Recover layers

(𝟐​𝐱+𝟑)t​h\mathbf{(2x+3)}_{th} layer

(𝟐​𝐱+𝟒)t​h\mathbf{(2x+4)}_{th} layer

Lemma 5.

$\mathbf{1}_{st}$ layer

$\mathbf{2}_{nd}$ layer

$\mathbf{(2x+3)}_{th}$ layer

$\mathbf{(2x+4)}_{th}$ layer