Learnability of Competitive Threshold Models
Abstract
Modeling the spread of social contagions is central to various applications in social computing. In this paper, we study the learnability of the competitive threshold model from a theoretical perspective. We demonstrate how competitive threshold models can be seamlessly simulated by artificial neural networks with finite VC dimensions, which enables analytical sample complexity and generalization bounds. Based on the proposed hypothesis space, we design efficient algorithms under the empirical risk minimization scheme. The theoretical insights are finally translated into practical and explainable modeling methods, the effectiveness of which is verified through a sanity check over a few synthetic and real datasets. The experimental results promisingly show that our method enjoys a decent performance without using excessive data points, outperforming off-the-shelf methods.
1 Introduction
Social contagion phenomenon such as the propagation of news and behavior patterns, or the adoption of opinions and new technologies, have attracted huge research interests from various fields like sociology, psychology, epidemiology and network science cozzo2013contact; Goldstone2005ComputationalMO; camacho2020four. Formal methods for modeling social contagion are crucial to many applications, for example, the recommendation and advertising in viral marketing qiu2018deepinf and misinformation detection tong2020stratlearner. Given the operational diffusion models, a fundamental problem is to which extend we can learn such models from data, i.e., the learnability of the models. In this paper, we study the classic threshold models in which the core hypothesis is that the total influence from active friends determines the contagion per a threshold. We seek to derive its PAC learnability and sample complexity by which efficient learning algorithms are possible.
Motivation.
Learning contagion models from data have been a central topic in social computing. Some existing works focused on estimating the parameters on diffusion models by local estimation techniques to further predict the influence results goyal2010learning; du2012learning; de2014learning. Most of their works made assumptions on the fully observed cascade data. In real life, we cannot always obtain the data set with local diffusion information. In our work, we also consider data set with incomplete observations. Moreover, our learnability results make no assumptions on the data type. Another line of work developed various graph neural network models to estimate the influence of each node combined with the deep learning techniques li2017deepcas; qiu2018deepinf; leung2019personalized. Although these methods have shown significant improvements in influence estimation performance, they do not have a theoretical guarantees of their approaches. In our work, we aim to show the analytical sample complexity for influence learning.
Recently, some works have focused on learning the influence function for different diffusion models from the training data, which maps from the seed set to the diffusion results. The works of du2012learning; he2016learning learn the approximate influence function by parameterization of a group of random reachability functions. Our approach will learn the influence function directly from their hypothesis space without any estimation. The most related works proposed the proper PAC learning algorithms to learn the influence function narasimhan2015learnability; adiga2019pac. Their work assumed that only one cascade diffused in the network. However, it is common that multiple competitive cascades disseminate simultaneously, for example, the network marketing campaign qiu2018deepinf. From this standpoint, we takes a step towards the influence estimation problem without limiting the number of cascades.
Contribution.
Our contribution can be summarized as follows.
-
•
A influence function class under competitive threshold model with finite VC dimensions: we simulate the diffusion process as the information propagation through the neural network with all piecewise polynomial units.
-
•
Proper PAC learning algorithms: We propose the learning algorithms under two data types following the empirical risk minimization learning strategy. In particular, we develop a polynomial time approach via linear programming under full observation.
-
•
Superior performance: The experiment results on different data set have demonstrated a better prediction performance for our proposed approaches.
2 Preliminaries
We first describe the diffusion model and then present the learning settings.
2.1 Competitive linear threshold model
We follow the standard competitive linear threshold (CLT) model ?, which has been widely adopted in the formal study of social contagions ?. A social network is represented as a directed graph , where is the set of nodes and is the set of edges, with and being the numbers of nodes and edges, respectively. For each node , we denote by the set of its in-neighbors.
We consider the scenario where there are information cascades, each of which starts to spread from their seed set for . Associated with each node and each cascade , there is a threshold ; associated with each edge and each cascade , there is a weight . Without loss of generality, we assume that the seed sets are disjoint and the weights are normalized such that for each node and cascade . The nodes are initially inactive and can be -active if activated by cascade . In particular, the diffusion process unfolds step by step as follows:
-
•
Step : For each cascade , nodes in become -active.
-
•
Step : Let be the set of -active nodes after step. There are two phases in step .
-
–
Phase 1: For an inactive node , let be the summation of the weights from its -active in-neighbors:
(1) The node is activated by cascade if and only if . After phase 1, it is possible that a node can be activated by more than one cascades.
-
–
Phase 2: For each node , it will be finally activated by the cascade with the largest weight summation:
(2) For tie-breaking, cascades with small indices are preferred.
-
–
Clearly, there are at most diffusion steps, and without loss of generality, we may assume that there are always diffusion steps. An illustration is given in Figure 1. The following notations are useful for formal discussions.
Definition 1.
We use a binary matrix to denote the initial status, where if and only if node is in the seed set of cascade . For a time step , the diffusion status is denoted by a binary tensor , where, for a cascade , if and only if node is -active after phase in time step .
We are interested in the learnability of the influence function of CLT models.
Definition 2.
Given a CLT model, the influence function maps from the initial status to the final status .
2.2 Learning settings
Assuming that the social graph is unknown to us, we seek to learn the influence function using a collection of pairs of initial status and resulted the statuses:
(3) |
where denotes the diffusion status after each phase in each diffusion step:
We use the zero-one loss to measure the difference between the prediction and the ground truth , which is defined as
For a distribution over the input and a CLT model generating the training data , we wish to leverage to learn a mapping such that the generalization loss can be minimized:
(4) |
3 Learnability analysis
In this section, we present a PAC learnability analysis for the influence function. The overall idea is to design a realizable hypothesis with a finite VC dimension by which the PAC learnability can be established. The following assumptions will be made throughout the analysis.
Assumption 1.
We assume that the diffusion model is a decimal system of a finite precision: for the involved parameters (i.e., and ), number of their decimal places is upper bounded by a constant . any references to support this one?
Assumption 2.
The cascade number is a power of for some constant .
3.1 Realizable hypothesis space
In order to obtain a realizable hypothesis space, we seek to explicitly simulate the diffusion function through neural networks. While the universal approximation theory ensures that such neural networks exist, we wish for explicitly designs of a polynomial size so as to admit efficient learning algorithms. To this end, we first demonstrate how to simulate the one-step diffusion under the CLT model, which can used repeatedly to simulate an arbitrary number of diffusion steps.
For the one-step diffusion, its simulation is done through two groups of layers that are used in turn to simulate phase 1 and phase 2. As illustrate is in Figure 2, the simulation starts by receiving the diffusion status after the last step, where for and , if and only if node is -active. After layers of transformations, it ends by outputting a vector , which corresponds exactly to the diffusion status after one diffusion step under the CLT model. The layer transformations all follow the canonical form:
(5) |
with being the weight matrix and being a collection of activation functions over the elements of . In the rest of this subsection, we will present our designs of the weight matrices and the activation functions that can fulfill the goal of simulation.
3.1.1 Phase 1
The single-cascade activation can be simulated by
(6) |
where for each and , we have
and for each and , we have
(7) |
For each and , the resulted is equal to the influence summation of cascade at node if is activated by cascade , and otherwise zero.
3.1.2 Phase 2
In simulating phase 2, the general goal is to figure out among the candidate cascades which one wins the competition. Intuitively, this can be implemented through a hierarchy of pairwise comparisons. One technical challenge is that comparison made directly by linear functions tells only the difference between the summations, while we need to keep the summation for future comparisons; we observe that this problem can be solved by using two consecutive layers of piecewise linear functions. A more serious problem is that simply passing the largest influence summation to following layers causes the loss of the information about the cascade identity (i.e. index), making is impossible to identify the node status – which is required as the input to simulating the next diffusion step. We propose to address such a challenge by appending the cascade identity to the weight summation as the lowest bits. The total shifted bits are determined by the system precision (Assumption 1). Doing so does not affect the comparison result while making it is possible to retrieve the cascade identity by scale and modulo operations. In particular, two groups of transformations are employed to achieve the pairwise comparison and status recovery.
Pairwise comparison. With the input from phase 1, we encode the cascade identity through
(8) |
where for each and , we have
(9) |
and for each and , we have
(10) |
The design of ensures that sufficient bits are available at the low side, and the activation function then writes the cascade identity into the empty bits (Remark 1 in Figure 2).
Using the information in , for a node , the largest influence summation can be determined by the sub-array
through pairwise comparisons. For the comparison between two cascades and at node (), as illustrated in Remark 2 in Figure 2, the difference
is first fed into a ReLu function of which the result is then added by . We can verify that this returns exactly
Therefore, two layers are used to eliminate a half of the candidate cascades, thereby resulting in layers in total. The output of this part is in which we have
for each . Notably, the lowest bits in stores the cascade index, which can be recovered by
(11) |
where is the identity matrix and we have
(12) |
Finally, the cascade index is converted to binary indicators through two layers. The first layer XXX and be achieved by
(13) |
where for each and , we have
(14) |
and
(15) |
The second layer is used to XXX and can be implemented through
(16) |
where for each and , we have
(17) |
and
(18) |
is exactly the one-step diffusion result after . Repeating such one-step simulations for steps, the CLT model can be explicitly simulated. One important property is that the entire neural network is composed by piecewise polynomial activation functions, which will used in deriving its the sample complexity. By scrutiny, the following result summarizes the network structure.
Lemma 1.
A CLT model can be simulated by a feed-forward neural network composed of adjustable weights, layers, and piecewise linear computation units each with pieces.
3.2 Efficient ERM
Taking the weights and the thresholds as parameters, the neural networks designed in the last subsection form a hypothesis space denoted by . Theorem 1 suggests that for any sample set , there always exists a perfect hypothesis in . In what follows, we show that such an empirical risk minimization solution can be efficiently computed. It is sufficient to find the parameters than can ensure the output of each activation function coincides with the diffusion result. Since the activation functions are all pairwise linear functions, the searching process can be done by a linear programming with polynomial number of constraints. Formally, we have the following statement.
Lemma 2.
For a CLT model and each sample set , there exists a hypothesis in with zero training error, and it can be computed in polynomial time in terms of , and by solving the following linear programming:
(19) |
Subject to:
3.3 Generalization Performance
The learnability of class is enabled by the analysis of its VC dimension.
Lemma 3.
The VC dimension of class is .
proof sketch.
The proof is obtained by first showing that the layers for one-step simulation has manageable VC dimension. Given the fact that repeating such one-step simulations does not increase the model capacity in terms of shattering points, the VC dimension of the entire neural network can be derived immediately. ∎
Together with Lemma 2, the following result summarizes our theoretical findings.
Theorem 1.
The influence function class is PAC learnable and the sample complexity is . Please state it in a following way: given the parameter and ,
proof sketch.
Following empirical risk minimization learning strategy, we can always find a group of parameters that let the empirical risk equal to and that guarantee a realizable case. Based on the The fundamental theorem of statistical learning ? and with a finite VC dimension of , we can obtain that is PAC learnable with sample complexity . To this end, the sample complexity of is a union bound over all node. ∎
Remark 1.
(Please discuss the case here. The main point is that your results recover the existing ones for S=2.)
Given two competitive cascades, we obtain the same bound for the VC-dimension of the influence function under the LT model with single cascade narasimhan2015learnability. The structure for the corresponding neural network is slightly different from our prescribed neural network. For each diffusion step, it contains three layers with all nodes, as illustrated in Figure 3.
The architecture of phase 1 is same as prescribed neural network. For phase 2,
(20) |
where for each and , we have
(21) |
and is a linear threshold function.
is exactly the graph status after phase 2. Therefore, the CLT model under two competitive cascades in one step can also be implement by a neural network with layers, adjustable parameters and computation units. Every output nodes is a linear threshold unit and the computational units are piecewise polynomial with pieces and degree in each piece. Based on theorem LABEL:, with fixed pieces and degree, we can obtain for node for one step. With a shared parameter, when , follow the same analysis in Lemma LABEL:, we have .
(Pleas organize the appendix so that we have proofs only to Theorem 1, lemma 1, lemma 2 corollary 1.)
4 Experiment
In this section, we present empirical studies on a collection of datasets, with goal to evaluate our algorithms against different baselines in two sample types. We aim to explore the impact of number of samples needed by the proposed learning algorithms to achieve reasonable performance.
4.1 Data
Synthetic data
We use two group synthetic graph data: a Kronecker graph with nodes and edges and a power-law graph with nodes and edges. Following the CLT model, for each cascade, we set the weight on the each edge as , where denotes the in-degree of node . For each node , we set the threshold for each cascade is generated uniformly at random from .
Real data
We apply our approaches the hashtag cascades data in the Gab network, which was collected from August 2016 to October 2018. The original network includes more than users and edges. different hashtags are recorded. Each hashtag can be seen as a cascade. To obtain the cascades with competitive relationships, we choose the hashtags with a relatively high overlap of forwarding users during a fixed period. In our experiments, we select hashtags ’trump’ and ’obama, since more than 30 percent of users both forwarded them from 2017 to 2018. And then we extract the corresponding social graph based on the users involved in these selected hashtags. The seed sets are the users who forward the cascades at the beginning of our collection. We set the time interval between each step as one day.
4.2 Experiment settings
Sample generating and CLT model
For each graph, we set cascades propagate simultaneously. In each sample, the total number of the seeds is set as . And each seed set is selected uniformly at random over all subset of . For each sample, the graph status during the diffusion step and the diffusion steps are record. For each graph, we collect samples.
Evaluation metrics
We use four metrics to evaluate each learning approaches: precision, recall, score and accuracy.
The experiments contains two groups.
The training samples are randomly selected in different sizes: . The test data size is fixed at .
-
•
Propose method: We use linear programming approach.
-
•
Baseline methods: We use multi-class classification supervised learning algorithms. For each node, we train a model to predict the node status after one time step . For all training samples, the training pair is defined as . In testing We set diffusion steps in different number to evaluate the performance of learned model.
-
–
Logistic regression model: The probability of a node in status is . The prediction status is cascade with the maximize probability.
-
–
SVM model: This method implement the “one-versus-rest” multi-class strategy.
-
–
MLP model: With all the ReLu activation functions, We set the number of hidden layers is 100 and the learning rate is .
-
–
-
•
We also do a random guess experiment.
4.3 Observation
In this section we list our main results in three table. Additional experiments results on power-law graph are in ?.
Overall observations
We compare the prediction performance of LP with other standard supervised machine learning algorithms. The main outcome on Kronecker graph and Gab cascade data is shown in Table 1, where the contents in each cell shows the corresponding metric with its standard deviation. The results show that the LP outperforms other methods with more training samples are given. With sample size , LP has already showed an accuracy with .
train_size | 100 | 500 | |||||||
---|---|---|---|---|---|---|---|---|---|
algorithm | iteration | f_1 score | precision | recall | accuracy | f_1 score | precision | recall | accuracy |
LP | - | 0.994/0.001 | 0.995/0.000 | 0.995/0.001 | 0.993/0.000 | 1.000/0.000 | 0.999/0.000 | 1.000/0.000 | 0.999/0.000 |
LR | 1 | 0.499/0.001 | 0.676/0.002 | 0.452/0.001 | 0.709/0.002 | 0.509/0.001 | 0.732/0.001 | 0.456/0.001 | 0.719/0.004 |
2 | 0.474/0.001 | 0.752/0.012 | 0.425/0.002 | 0.705/0.004 | 0.473/0.002 | 0.811/0.004 | 0.421/0.002 | 0.708/0.003 | |
3 | 0.460/0.003 | 0.802/0.010 | 0.412/0.002 | 0.705/0.006 | 0.454/0.001 | 0.852/0.003 | 0.406/0.001 | 0.703/0.005 | |
4 | 0.452/0.003 | 0.827/0.005 | 0.405/0.002 | 0.701/0.004 | 0.449/0.001 | 0.867/0.004 | 0.401/0.001 | 0.703/0.005 | |
5 | 0.451/0.003 | 0.832/0.006 | 0.404/0.002 | 0.701/0.004 | 0.446/0.002 | 0.876/0.003 | 0.400/0.001 | 0.703/0.003 | |
SVM | 1 | 0.556/0.002 | 0.716/0.002 | 0.502/0.002 | 0.727/0.003 | 0.583/0.001 | 0.797/0.002 | 0.517/0.001 | 0.747/0.003 |
2 | 0.496/0.001 | 0.731/0.003 | 0.446/0.001 | 0.705/0.004 | 0.504/0.002 | 0.805/0.004 | 0.447/0.001 | 0.713/0.006 | |
3 | 0.480/0.003 | 0.770/0.009 | 0.430/0.002 | 0.702/0.005 | 0.483/0.002 | 0.834/0.001 | 0.429/0.001 | 0.706/0.004 | |
4 | 0.470/0.004 | 0.798/0.009 | 0.421/0.003 | 0.697/0.003 | 0.473/0.001 | 0.855/0.002 | 0.421/0.001 | 0.701/0.004 | |
5 | 0.465/0.003 | 0.818/0.006 | 0.417/0.003 | 0.698/0.006 | 0.468/0.003 | 0.863/0.002 | 0.417/0.001 | 0.700/0.006 | |
MLP | 1 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.935/0.000 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.935/0.000 |
2 | 0.909/0.001 | 0.975/0.000 | 0.861/0.001 | 0.934/0.001 | 0.908/0.001 | 0.975/0.000 | 0.859/0.001 | 0.935/0.001 | |
3 | 0.909/0.001 | 0.975/0.000 | 0.861/0.001 | 0.935/0.001 | 0.909/0.000 | 0.975/0.000 | 0.860/0.001 | 0.934/0.001 | |
4 | 0.909/0.001 | 0.975/0.001 | 0.861/0.002 | 0.935/0.001 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.934/0.001 | |
5 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.934/0.000 | 0.909/0.000 | 0.975/0.000 | 0.860/0.001 | 0.935/0.001 | |
random guess | - | 0.209/0.002 | 0.250/0.002 | 0.250/0.004 | 0.250/0.002 | 0.210/0.001 | 0.250/0.001 | 0.250/0.001 | 0.250/0.001 |
train_size | 100 | 500 | |||||||
---|---|---|---|---|---|---|---|---|---|
algorithm | iteration | f_1 score | precision | recall | accuracy | f_1 score | precision | recall | accuracy |
LP | - | 0.994/0.001 | 0.995/0.000 | 0.995/0.001 | 0.994/0.000 | 1.000/0.000 | 0.999/0.000 | 1.000/0.000 | 0.999/0.000 |
LR | 1 | 0.207/0.002 | 0.292/0.009 | 0.259/0.000 | 0.618/0.012 | 0.207/0.002 | 0.288/0.008 | 0.260/0.001 | 0.619/0.003 |
2 | 0.208/0.005 | 0.283/0.005 | 0.261/0.002 | 0.609/0.002 | 0.205/0.001 | 0.293/0.004 | 0.259/0.001 | 0.609/0.010 | |
3 | 0.209/0.001 | 0.293/0.011 | 0.261/0.001 | 0.610/0.006 | 0.206/0.002 | 0.291/0.016 | 0.260/0.001 | 0.613/0.002 | |
4 | 0.205/0.004 | 0.294/0.009 | 0.259/0.001 | 0.611/0.012 | 0.206/0.004 | 0.293/0.003 | 0.260/0.002 | 0.609/0.009 | |
5 | 0.206/0.000 | 0.284/0.009 | 0.259/0.000 | 0.608/0.001 | 0.206/0.000 | 0.285/0.007 | 0.259/0.001 | 0.611/0.006 | |
SVM | 1 | 0.230/0.002 | 0.281/0.017 | 0.274/0.001 | 0.622/0.002 | 0.228/0.002 | 0.257/0.002 | 0.274/0.001 | 0.626/0.007 |
2 | 0.231/0.002 | 0.286/0.005 | 0.275/0.001 | 0.618/0.001 | 0.229/0.001 | 0.258/0.002 | 0.275/0.001 | 0.624/0.001 | |
3 | 0.226/0.002 | 0.272/0.003 | 0.273/0.001 | 0.617/0.006 | 0.229/0.003 | 0.263/0.001 | 0.275/0.000 | 0.620/0.012 | |
4 | 0.230/0.000 | 0.278/0.004 | 0.275/0.001 | 0.617/0.001 | 0.231/0.000 | 0.268/0.004 | 0.277/0.001 | 0.623/0.005 | |
5 | 0.231/0.001 | 0.280/0.001 | 0.276/0.002 | 0.620/0.012 | 0.230/0.001 | 0.265/0.004 | 0.277/0.001 | 0.619/0.003 | |
MLP | 1 | 0.230/0.001 | 0.272/0.006 | 0.274/0.001 | 0.621/0.000 | 0.231/0.004 | 0.268/0.010 | 0.277/0.002 | 0.621/0.003 |
2 | 0.233/0.002 | 0.286/0.002 | 0.275/0.001 | 0.628/0.012 | 0.231/0.004 | 0.269/0.001 | 0.276/0.002 | 0.629/0.003 | |
3 | 0.230/0.003 | 0.272/0.012 | 0.274/0.002 | 0.622/0.007 | 0.230/0.002 | 0.273/0.000 | 0.276/0.002 | 0.623/0.004 | |
4 | 0.227/0.005 | 0.274/0.004 | 0.272/0.002 | 0.617/0.002 | 0.229/0.002 | 0.284/0.016 | 0.274/0.001 | 0.621/0.001 | |
5 | 0.230/0.001 | 0.280/0.013 | 0.274/0.000 | 0.619/0.004 | 0.228/0.001 | 0.261/0.002 | 0.275/0.001 | 0.620/0.004 | |
random guess | - | 0.214/0.002 | 0.251/0.002 | 0.251/0.003 | 0.251/0.001 | 0.213/0.000 | 0.250/0.000 | 0.250/0.001 | 0.250/0.000 |
Impact of the number of cascades
5 Conclusion
Appendix A Proofs
A.1 Proof of Theorem 1
(Please revise the proof of Theorem 1: for each , please explicitly state what kind of function they are; please explicitly count the numbers you have in the theorem.)
We design a neural network to simulate the diffusion process under CLT model. For each diffusion step, we
Phase 1
Phase 2
A.2 Proof of Lemma 1
please follow the following structure of the proof: for each time step transformation to , explicitly write down the constraint given by each activation function in Figure 2; after that, summarize all the constraints into one linear programming
The linear programming approach shows that ERM is efficient learnable from the sample data. Given the sample set , during the diffusion process, the observed activation status in sample after time step is given by . The loss of an influence function for the sample is given by
(22) |
In the prescribed neural network with one diffusion step , both output after phase 1 and phase 2 can be seen as a binary vector with size . We first focus on each output node .
Phase 1
Given the graph status after step , let denote the computation of the output node after phase 1, and , where is the parameters related to node . Then, the local prediction error can be defined as
(23) |
During the phase 1 diffusion process in each step, the propagation for each cascade can be seen as cascades diffused independently according to the linear threshold model (And the neural network during phase 1 is a linear threshold network). Therefore, there always exists parameters such that after phase 1.
Phase 2
In phase 2, the comparison process is happened without the unknown parameters. Let denote the computation of the output node after phase 1. Then, the local prediction error can be defined as
(24) |
Formulating the LP
Our learning problem can be decoupled into a group of linear programmings for each time step in each samples. Our optimization problem is over the variables and a groups of slack variables for and , which is designed to help us formulate the objective function. Therefore,
(25) |
Subject to:
where
A.3 Proof of Lemma 1 ( the numbers should be generated by ref rather than hard coded)
The realizable assumption on hypothesis class is that there is a , such that given a sample set , . Given the influence function class , the samples are generated from the distribution of and a target function . In our setting, we care about the function . Therefore, given our designed neural network model with a finite VC dimension, following the ERM strategy, we can obtain a probably approximately model, which also called restricted model. Thus, there always be a function consist with samples bartlett1999hardness. Therefore, suppose that parameters is the underlying parameters of CLT model, the following equation is always satisfied:
(26) |
This complete the proof.
A.4 Proof of Lemma 2
The following theorem gives the VC dimension of the class of neural networks with all piecewise-polynomial units.
Theorem 2.
? Suppose is a feed-forward network with a total of weights and computational units, in which the output unit is a linear threshold unit and every other computation unit has a piecewise-polynomial activation function with pieces and degree no more than . Suppose in addition that the computation units in the network are arranged in layers, so that each unit has connections only from units in earlier layers. Then if is the class of functions computed by ,
(27) |
for a fixed and ,
Note that the influence function class is a set of all possible neural network we designed in section 3. We first consider the neural networks with single output . We use to denote the computation at the output node after time step , where . Based on the architecture of the neural network, for any , the attributes the corresponding neural network can be summarized in Table 3.
parameters | symbol | diffusion step |
---|---|---|
layers | ||
maximum pieces of all units | ||
maximum degree of all pieces | ||
computational units | ||
adjustable parameters |
Therefore, we can obtain the upper bound of the VC dimension when with one output , and denote as . Follow this idea, if we ignore the same parameters during each time step, the total number of unknown parameters in the global neural network would be . And according to theorem LABEL:theorem:VC, we will obtain a VC dimension bound of . However, all the parameters we want learn are shared during the global diffusion process in , which leads to a lower ability of shattering a subset of points for that neural network. Therefore we can obtain a tighter bound of VC dimension for influence function class when .
To see this, we first calculate the , which is . Now we prove that the when . Consider a set of points shattered by , where . From the definition of shattered set,we have:
(28) |
where
From above equation, we have
(29) |
We could see that the have to be different so that all the patterns of of nodes can be realized. This implies that the set is shattered by in time step . Thus for any set of points of a given size shattered by when , there exists a set of points of the same size shattered by in time step . Therefore:
(30) |
Hence, . The result is summarized in Theorem LABEL:theorem:VC.
A.5 Proof of corollary 1
Theorem 3.
? Let be a hypothesis class of functions from a domain to and let the loss function be the loss. Then, the following are equivalent:
-
•
has the uniform convergence property.
-
•
Any ERM rule is a successful agnostic PAC learner for .
-
•
is agnostic PAC learnable.
-
•
is PAC learnable.
-
•
Any ERM rule is a successful PAC learner for
-
•
has a finite VC dimension
Theorem 4.
Let be a hypothesis class of functions from a domain to and let the loss function be the loss. Assume that . Then, there are absolute constants such that is PAC learnable with sample complexity:
(31) |
Therefore, based on the Fundamental Theorem of Statistical Learning, the influence function is PAC learnable with sample complexity for any such that with confidence , the generalization error , where
(32) |
By taking a union bound over all nodes, we can get .
Appendix B Additional materials for experiments
train_size | 50 | 100 | 500 | 1000 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
algorithm | iteration | f_1 score | precision | recall | accuracy | f_1 score | precision | recall | accuracy | f_1 score | precision | recall | accuracy | f_1 score | precision | recall | accuracy |
LP | - | 0.991/0.001 | 0.993/0.000 | 0.989/0.001 | 0.902/0.001 | 0.994/0.001 | 0.995/0.000 | 0.995/0.001 | 0.993/0.000 | 1.000/0.000 | 0.999/0.000 | 1.000/0.000 | 0.999/0.000 | ||||
LR | 1 | 0.488/0.002 | 0.622/0.002 | 0.448/0.002 | 0.695/0.002 | 0.499/0.001 | 0.676/0.002 | 0.452/0.001 | 0.709/0.002 | 0.509/0.001 | 0.732/0.001 | 0.456/0.001 | 0.719/0.004 | 0.508/0.001 | 0.732/0.002 | 0.455/0.001 | 0.718/0.005 |
2 | 0.471/0.002 | 0.668/0.011 | 0.427/0.002 | 0.697/0.004 | 0.474/0.001 | 0.752/0.012 | 0.425/0.002 | 0.705/0.004 | 0.473/0.002 | 0.811/0.004 | 0.421/0.002 | 0.708/0.003 | 0.472/0.001 | 0.816/0.004 | 0.421/0.001 | 0.708/0.005 | |
3 | 0.460/0.001 | 0.713/0.022 | 0.416/0.001 | 0.697/0.004 | 0.460/0.003 | 0.802/0.010 | 0.412/0.002 | 0.705/0.006 | 0.454/0.001 | 0.852/0.003 | 0.406/0.001 | 0.703/0.005 | 0.454/0.001 | 0.853/0.003 | 0.406/0.001 | 0.702/0.002 | |
4 | 0.458/0.004 | 0.743/0.019 | 0.413/0.004 | 0.698/0.006 | 0.452/0.003 | 0.827/0.005 | 0.405/0.002 | 0.701/0.004 | 0.449/0.001 | 0.867/0.004 | 0.401/0.001 | 0.703/0.005 | 0.448/0.001 | 0.874/0.003 | 0.401/0.001 | 0.700/0.006 | |
5 | 0.454/0.005 | 0.772/0.017 | 0.408/0.005 | 0.699/0.004 | 0.451/0.003 | 0.832/0.006 | 0.404/0.002 | 0.701/0.004 | 0.446/0.002 | 0.876/0.003 | 0.400/0.001 | 0.703/0.003 | 0.446/0.002 | 0.879/0.002 | 0.399/0.001 | 0.703/0.005 | |
SVM | 1 | 0.537/0.001 | 0.663/0.008 | 0.491/0.001 | 0.716/0.004 | 0.556/0.002 | 0.716/0.002 | 0.502/0.002 | 0.727/0.003 | 0.583/0.001 | 0.797/0.002 | 0.517/0.001 | 0.747/0.003 | 0.904/0.001 | 0.974/0.000 | 0.854/0.001 | 0.932/0.001 |
2 | 0.488/0.001 | 0.674/0.005 | 0.443/0.001 | 0.695/0.002 | 0.496/0.001 | 0.731/0.003 | 0.446/0.001 | 0.705/0.004 | 0.504/0.002 | 0.805/0.004 | 0.447/0.001 | 0.713/0.006 | 0.904/0.001 | 0.975/0.000 | 0.854/0.001 | 0.933/0.001 | |
3 | 0.476/0.003 | 0.705/0.011 | 0.431/0.002 | 0.693/0.002 | 0.480/0.003 | 0.770/0.009 | 0.430/0.002 | 0.702/0.005 | 0.483/0.002 | 0.834/0.001 | 0.429/0.001 | 0.706/0.004 | 0.904/0.000 | 0.974/0.000 | 0.854/0.001 | 0.931/0.000 | |
4 | 0.471/0.004 | 0.734/0.017 | 0.425/0.004 | 0.696/0.002 | 0.470/0.004 | 0.798/0.009 | 0.421/0.003 | 0.697/0.003 | 0.473/0.001 | 0.855/0.002 | 0.421/0.001 | 0.701/0.004 | 0.905/0.001 | 0.974/0.000 | 0.855/0.001 | 0.932/0.000 | |
5 | 0.465/0.004 | 0.757/0.018 | 0.420/0.004 | 0.692/0.005 | 0.465/0.003 | 0.818/0.006 | 0.417/0.003 | 0.698/0.006 | 0.468/0.003 | 0.863/0.002 | 0.417/0.001 | 0.700/0.006 | 0.904/0.001 | 0.974/0.001 | 0.854/0.002 | 0.932/0.001 | |
MLP | 1 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.935/0.001 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.935/0.000 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.935/0.000 | 0.910/0.001 | 0.975/0.000 | 0.862/0.001 | 0.935/0.000 |
2 | 0.909/0.001 | 0.975/0.001 | 0.862/0.001 | 0.934/0.001 | 0.909/0.001 | 0.975/0.000 | 0.861/0.001 | 0.934/0.001 | 0.908/0.001 | 0.975/0.000 | 0.859/0.001 | 0.935/0.001 | 0.908/0.001 | 0.975/0.000 | 0.859/0.001 | 0.935/0.000 | |
3 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.934/0.001 | 0.909/0.001 | 0.975/0.000 | 0.861/0.001 | 0.935/0.001 | 0.909/0.000 | 0.975/0.000 | 0.860/0.001 | 0.934/0.001 | 0.909/0.000 | 0.975/0.000 | 0.860/0.001 | 0.935/0.001 | |
4 | 0.909/0.000 | 0.975/0.000 | 0.860/0.000 | 0.935/0.001 | 0.909/0.001 | 0.975/0.001 | 0.861/0.002 | 0.935/0.001 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.934/0.001 | 0.909/0.000 | 0.975/0.000 | 0.861/0.000 | 0.935/0.001 | |
5 | 0.909/0.000 | 0.975/0.000 | 0.860/0.001 | 0.934/0.001 | 0.909/0.001 | 0.975/0.000 | 0.860/0.001 | 0.934/0.000 | 0.909/0.000 | 0.975/0.000 | 0.860/0.001 | 0.935/0.001 | 0.909/0.000 | 0.975/0.000 | 0.861/0.001 | 0.935/0.000 | |
random guess | - | 0.208/0.002 | 0.250/0.002 | 0.250/0.003 | 0.250/0.002 | 0.209/0.002 | 0.250/0.002 | 0.250/0.004 | 0.250/0.002 | 0.210/0.001 | 0.250/0.001 | 0.250/0.001 | 0.250/0.001 | 0.208/0.000 | 0.250/0.001 | 0.250/0.001 | 0.250/0.001 |
train size | 50 | 100 | 500 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
algorithm | number of cascades | f1_score | precision | recall | accuracy | f1_score | precision | recall | accuracy | f1_score | precision | recall | accuracy |
LP | 2 | ||||||||||||
4 | |||||||||||||
8 | |||||||||||||
LR | 2 | ||||||||||||
4 | |||||||||||||
8 | |||||||||||||
SVM | 2 | ||||||||||||
4 | |||||||||||||
8 | |||||||||||||
NN | 2 | ||||||||||||
4 | |||||||||||||
8 |
parameters:
-
•
, for
-
•
, -active for ; for .
-
•
sample:
B.0.1 Phase 1
-
•
size:
-
•
(33) Or if I can use to index the nodes in first layer and second layer, I can use following equation:
(34) -
•
size:
-
•
:
(35) Or if I can use to index the nodes in second layer, I can use following equation:
(36)
B.0.2 Phase 2
Pairwise comparison
-
•
Add cascade index
-
•
(37) -
•
size:
-
•
:
(38) -
•
Begin comparison: the composition of these comparison layers can be seen as identical structure (each structure contains nodes) from the longitudinal observation. Among each identical structure, the idea is that we compare the node value between every two adjacent nodes and pass the larger value forward until there has only one node left. And the node size will be reduced by half for every two layers. Therefore, to complete the pairwise comparison process, layers are needed. Furthermore, during each pairwise comparison(every two layers among layers), the architecture is same.
-
•
(39) -
•
size:
-
•
: ReLu
(40) -
•
(41) -
•
size:
-
•
: Identity
(42) -
•
size:
-
•
: Relu
-
•
size:
-
•
: Identity
-
•
…
-
•
size:
-
•
: Identity
Status recovering
-
•
(43) -
•
size:
-
•
:
(44) -
•
(45) -
•
size:
-
•
:
(46)
layer
takes as the input. For further illustration, we reshape the matrix to a vector . ( I think using vector format is more easier for me to describe the weight matrix clearly.) Let denote the input layer. The matrix indicates the weights that are associated with each edges in the graph. We use to index the nodes in input layer and layer, where and . Furthermore, to keep the activated nodes remain active for the following diffusion process, an additional link with a fixed weight will be added between each node in these two layers. The weight matrix is defined as follows.
(47) |
The activation function is designed to execute the linear threshold function with the unknown thresholds. In order to simulate the phase diffusion process, it is crucial to record the weight summation value for each cascades, which is exactly . Therefore, the activation function is defined as follows.
(48) |
Figure 4 gives an example of the structure for this group with cascades. Obviously, the value of the nodes in the layer indicates the graph status after phase .
Lemma 4.
For , if , then node in graph is -active after phase ; otherwise inactive.
layer
This layer is designed to preserve the pre-activated cascade index . An additional binary node are added for each node in layer. The matrix is fixed and defined as follows.
(49) |
The activation function is the identity function and defined as follows.
(50) |
B.0.3 Comparison layers
This group simulates the phase 2 diffusion process. Since the comparison among the weight summation for each cascade is happened inside every nodes, the composition of these comparison layers can be seen as identical structure from the longitudinal observation. An example of one user with cascades is given in Figure 5. For every nodes, we construct a comparison structure. The idea is that we compare the values between every adjacent nodes and pass the larger value forward until there has only one node left. And the node size will be reduced by half for every layers. Therefore, to complete the comparison process, layers are needed.
In the whole comparison layers for every nodes, the weight matrix is identical and fixed. Since the node size keep reducing by half for every layers, to simplify, we set the input size is , for . If , is defined in equation 51; otherwise, is in equation 52.
(51) |
(52) |
ReLu and identity function are taking in turn among this group. When , we use ReLu function; otherwise, we use identity function. For , the activation function is defined as follows.
(53) |
Back to the whole structure of these layers, by constructing comparison for every nodes, the output is a vector with size . Each node in this layer can be seen as the node in the network. And its value indicates the activation status after phase 2 along with the corresponding weights summation.
B.0.4 Recover layers
This group contains layers and is designed to recover the format of the output. Figure 6 shows an example of the structure of this group.
layer
This layer is designed to extract the cascade index from the single digit of . We use to index each node. The weight matrix is defined as follows.
(54) |
We use a step function (a constant polynomial with degree equal to zero) with pieces to return the final status of each node in the graph and it is defined as follows.
(55) |
layer
This layer is used to convert the format of the output layer as same as the layer. And the output is .Weight matrix is defined as follows.
(56) |
For each node in the output layer, we introduce a fixed threshold . The activation function is defined as follows.
(57) |
Therefore, the output of is the graph status after phase 2.
Lemma 5.
(Same issue with the format of I)