¹¹institutetext: Indraprastha Institute of Information Technology, Delhi ²²institutetext: Indian Institute of Technology, Delhi ³³institutetext: Indian Institute of Technology, Hyderabad ⁴⁴institutetext: Queen’s University, Belfast
⁴⁴email: [email protected], ⁴⁴email: [email protected], ⁴⁴email: [email protected], ⁴⁴email: [email protected], and ⁴⁴email: [email protected]

Modeling Implicit Communities using Spatio-Temporal Point Processes
from Geo-tagged Event Traces

Ankita Likhyani 11 Vinayak Gupta 22 Srijith PK 33 Deepak P 44 Srikanta Bedathur 22

Abstract

The location check-ins of users through various location-based services such as Foursquare, Twitter and Facebook Places, etc., generate large traces of geo-tagged events. These event-traces often manifest in hidden (possibly overlapping) communities of users with similar interests. Inferring these implicit communities is crucial for forming user profiles for improvements in recommendation and prediction tasks. Given only time-stamped geo-tagged traces of users, can we find out these implicit communities, and characteristics of the underlying influence network? Can we use this network to improve the next location prediction task? In this paper, we focus on the problem of community detection as well as capturing the underlying diffusion process and propose a model CoLAB based on spatio-temporal point processes in continuous time but discrete space of locations that simultaneously models the implicit communities of users based on their check-in activities, without making use of their social network connections. CoLAB captures the semantic features of the location, user-to-user influence along with spatial and temporal preferences of users. To learn the latent community of users and model parameters, we propose an algorithm based on stochastic variational inference. To the best of our knowledge, this is the first attempt at jointly modeling the diffusion process with activity-driven implicit communities. We demonstrate CoLAB achieves upto 27% improvements in location prediction task over recent deep point-process based methods on geo-tagged event traces collected from Foursquare check-ins.

1 Introduction

Proliferation of smartphone usage and pervasive data connectivity have made it possible to collect enormous amounts of mobility information of users with relative ease. Foursquare announced in early 2018 that it collects more than $3$ billion events every month from its $25$ million users¹¹1https://bit.ly/2BdhnnP (accessed in February 2019). These events generate a location information diffusion process through an underlying –possibly hidden– network of users that determines an location adoption behavior among users. Location adoption primarily depends upon user’s spatial, temporal and categorical preferences. For instance, one user’s check-in at a newly opened jazz club could inspire another user to visit the same club or a similar club in her vicinity depending upon the distance from the club and time of the day/week. These users might not be having a social connection but it’s an implicit influence because of similar choices. This often leads to the formation of –possibly overlapping– communities of users with similar behavior. Detecting community of such like minded people from large geo tagged events can benefit applications of various domains such as targted advertisements and friend recommendation. Prior work [39] also suggests that social connections are not as effective factors for prediction tasks.

In this paper, we move away from communities derived purely from social network links, and instead explicitly identify spatio-temporal activity-driven communties. Eschewing the reliance on social connections alone allows us to learn communities that are not overly biased towards connected users. At the same time, it supports the use of CoLAB under settings where social network is not available – either due to privacy settings of users, or due to other restrictions from platform.

Example 1

We illustrate the spatio-temporal activity-based network and communities that CoLAB derives using the US dataset we collected, shown in Figure 1. We observe that explicit social connections, shown using black edges, are very sparse and are unable to capture the implicit influence network. On the other hand, the latent network derived by CoLAB, shown using gray edges, can identify significantly higher number of relations in the latent influence network. It can be observed that it is not only captures the clusters but also identify potential influencers (and their influence networks) which can critically help in prediction tasks.

In this paper we determine that the location adoption process and community formation among users can be explained by the same latent factors underlying the observed behavior of users without considering their social network information. We propose CoLAB (Communities of Location Adoption Behaviour) that focuses on jointly inferring location specific influence scores between users and the communities they belong to, based solely on their activity traces. CoLAB completely disregards the social network information, which makes it suitable for scenarios where only activity traces are available. Note that if social network is available, it can be used as a prior or a regulization over the influence matrix we derive. Unlike the current best-performing models for location prediction task (e.g., [36]), CoLAB avoids the community formation from being biased only by the availability of social connections. Thus we generate better communities even for users with few or no social connections as shown in figure 1.

Further, we generate overlapping communities of users that take into account the spatio-temporal patterns, and the shared special interests over location categories. Although a user may be part of multiple communities with different special interests, we assume each event or check-in to be associated with only one of those communities. Prior works have focused on modeling temporal + textual together [39] or temporal + spatial features together [35], none of the techniques to the best of our knowledge have modeled the entire combination of temporal, spatial and location semantics in a spatio-temporal point process to infer the underlying influence network.

Contributions. Our main contributions are as follows:

1.

We propose a novel model called CoLAB to model activity patterns over geo-tagged event traces. It leverages spatio-temporal Hawkes process[5, 28] to not only construct a information diffusion-based latent network but also recover overlapping community structures within it.
2.

We develop a novel stochastic variational inference technique to learn the latent communities and model parameters.
3.

As our target is to identify communities that comprise users who share interests as opposed to just being socially connected, there is unfortunately no gold-standard community information to evaluate our results. Therefore, we first empirically evaluate our method over synthetic data; further, results shows our inference algorithm can accurately recover the model parameters. For communties evaluation on real data, we make use of a joint loss function that evaluates on the basis intra-community properties defined in terms of users’ category affinity and spatial dispersment in their checkin characteristics.
4.

We evaluate on two real-world geo-tagged event traces collected from two countries – viz., SA (Saudi Arabia) and US. The experimental results demonstrates that we achieve upto 27% improvement over neural network based models.

2 Related Work

In this section, we briefly review existing literature on community detection and characterization of location adoption.

2.1 Community Detection

Within general social networks, there has been much work on detecting communities as overlapping or non-overlapping clusters of users such that there is a high degree of connectedness between them. Techniques have largely considered social-network connectedness as the main driver in forming communities. Community detection techniques could adopt a top-down approach starting from the entire graph and form communities by separating them into more coherent subsets that would eventually become communities. Methods from this school include those that use graph-based methods [27] and filtering out edges based on local informations such as number of mutual friends [40]. Analogously, community discovery could proceed bottom-up by aggregating proximal nodes to form communities. Techniques within this category have explored methods such as merging proximal cliques [16] or by grouping nodes based on affinities to ’leader’ nodes [15]. Point processes[6], which are popular models for sequential and temporal data, have been recently explored for community detection in general social networks[29]. NetCodec, the method proposed therein, targets to simultaneously detect community structure and network infectivity among individuals from a trace of their activities. The key idea is to leverage multi-dimensional Hawkes process in order to model the relationship between network infectivity vis-a-vis community memberships and user popularity, in order to address the community discovery and network infectivity estimation tasks simultaneously. Our task of community detection within LBSNs is understandably more complex due to the primacy of spatial information in determining LBSN community structures. Accordingly, we leverage a spatio-temporal variant of multi-dimensional Hawkes process, in devising our solution.

LBSN Community Detection: There has been existing work [25, 31, 23, 19] on community detection in LBSNs that leverage features such as venue categories, temporal features, geo-span, social-status, structure based and location based features in determining community structure within clustering formulations. They have used standard ML based techniques such as Spectral clustering [25], $M^{2}$ [31]: a k-means based clustering, Entropy-based [23], and Sequence-Mining [19] based techniques. Our method, as illustrated therein, models a wider variety of features, incorporating both spatio-temporal check-in information as well as venue category information, in inferring communities and user-user influence.

2.2 Location Adoption Characterization

With each LBSN check-in being associated with a location, the location information is central to LBSNs. There has been much research into modelling user-location correlations in various forms, which may be referred to as location adoption in general.

[14] is the closest work to our work, where authors determine the patterns from geo located posts from twitter. Our model deals with discrete locations unlike [14] which simply considers sampling of spatial locations from a continuous distribution which can not be used for modeling check-in locations. The intensity function in CoLAB seamlessly integrates temporal and spatial components of the model and jointly learns the parameters whereas in [14] location and time are modeled separately. Moreover, we use multi-dimensional Hawkes Process as intensities differs across users. The parameter estimation of multi-dimensional Hawkes Process is much harder and Monte Carlo Simulations based methods used in [14] are very slow. We introduce the use of stochastic variational inference (SVI) [2] to overcome this. [11] and [10] exploit the role of social correlations and temporal cyclical behaviors in improving upon the task of predicting the next location that a user would check-in. Location categories (e.g., restaurant, pub etc.) has also been seen to be useful for next location prediction[22]. [34] exploit geographical influence across locations in recommending POIs (points of interest) to LBSN users.

Another task that has attracted significant attention within LBSNs is that of quantifying user influence in LBSNs. [37] consider using geo-social correlation between users to estimate mutual influence. [32] leverage the observation that mutual influence is central to successful recommendation to identify a set of influential seed nodes using information cascade models. [3] adopt influence propagation models to address a slight variant, that of identifying region-specific influential users. [18] and [30] propose other kinds of models for the influence maximization task within LBSNs. ocation Promotion [42][21] and Trip Purpose Detection [13] form other les addressed tasks within the context of LBSNs. Mobility models that mine spatial patterns based on generative models [11], Gaussian distributions [4] and kernel density based estimations [20] have been particularly popular in modeling location adoption behavior within LBSNs.

3 Problem Statement

Consider an geo-tagged event trace dataset $S$ over a set of $L$ locations $\mathbb{L}=\{\ell_{k}\}_{k=1}^{L}=\{(x_{k},y_{k})\}_{k=1}^{L}$ , a set of $I$ users $\mathbb{U}=\{i_{k}\}_{k=1}^{I}$ and $V$ categories (restaurants, entertainments etc.). Let us consider there are $N$ events, with the $n^{th}$ check-in denoted as $E_{n}=(t_{n},\ell_{n},c_{n},i_{n},g_{n})$ and $\mathbb{E}=\{E_{n}\}_{n=1}^{N}$ . The notation denotes that $E_{n}$ is the check-in event involving the user $i_{n}$ checking in to location $\ell_{n}$ at time $t_{n}$ , with the category associated with the location being $c_{n}$ and the latent community of the user associated with the check-in is $g_{n}$ . The task is to learn the latent community associated with the users and effectively model the diffusion of information among users. Towards this, we aim to learn a matrix $\phi$ of size $|\mathbb{U}|$ x $|M|$ where $i^{th}$ row represents community participation for the $i^{th}$ user (assuming $M$ communities). In addition, to model the diffusion process, we estimate matrix $A_{ij}$ , where an element $a_{ij}$ represents the influence of $i^{th}$ user on $j^{th}$ user.

4 Preliminaries

Symbol	Description
$\mathbb{U}$	set of $I$ users
$\mathbb{L}$	set of $L$ locations
$\mathbb{G}$	set of $M$ communities (latent)
$\lambda_{i}(t,x,y)$	intensity at time $t$ and spatial coordinate $x,y$ of user $i$
$\mu_{i}$	base rate of checkins of user $i$
$A_{ij}$	influence of user $i$ on user $j$
$\kappa(t,x,y)$	triggering / self exciting kernel

Table 1: Terminology

Hawkes Process [12]. It is a point process with self triggering property, has been explored to model temporal occurrences of events in social media [zhao15]. The conditional intensity is modelled as: $\lambda(t)=\mu+\sum_{t_{k}<t}\kappa(t-t_{k}),$ where $\mu$ is the base intensity of event occurrence, with $\kappa()$ (kernel function) modelling the influence of past events using an exponential decay function.

Multidimensional Hawkes Process. This extension allows modelling of influence of events across entities (e.g., users in the social media case) by considering time-stamped events from across entities [41]; this naturally yields to modelling infectivity across users. The intensity function for an entity $i$ at time $t$ depends on past events as: $\lambda_{i}(t)=\mu_{i}+\sum_{t_{k}<t}A_{i_{k}i}\kappa(t-t_{k}),$ where $(t_{k},i_{k})$ represents historical user/entity events on time scale, $\mu_{i}$ $>$ 0 being the base intensity for the $i^{th}$ entity/user, the non-negative influence matrix $A_{ij}$ quantifying the quantum of influence from user/entity $i$ to $j$ .

Spatio-Temporal Hawkes Process:. This extends the basic Hawkes process along the spatial dimension[5]. These simultaneously model spatio-temporal clustering behavior and has been found useful for modelling dynamics of epidemics and crime. The intensity function takes the form:

\lambda(\ell,t|H_{t})=\mu+\sum_{t_{k}<t}\kappa(\ell-\ell_{k},t-t_{k}),

where $\ell_{k}$ is the location associated with the event at $t_{k}$ .

5 CoLAB Model

In this section we describe our model to infer the $\phi$ matrix i.e. the communities vector inferred from check-in activities and information diffusion over the users in the network.

5.1 Spatio-Temporal Data Modeling

Given the check-in events $E$ as defined earlier, for modeling time and spatial components w.r.t. communities, we define the Hawkes process based model as follows:

5.1.1 Intensity Function

We model the user’s community-specific intensity using the multi-dimensional spatio-temporal Hawkes process. Multi-dimensional because influence from other users also contribute in the intensity of a user [41]. Consider the task of estimating the community-specific intensity of a user $i_{n}$ towards generating a check-in $E_{n}=(t_{n},\ell_{n},c_{n},i_{n},g_{n})$ ; the multi-dimensional spatio-temporal Hawkes process formulation yields the following:

\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})=\mu_{i_{n}}\eta_{g_{n}}+\sum_{t_{k}<t_{n}}A_{i_{k}i_{n}}\kappa(t_{n}-t_{k},\ell_{n}-\ell_{k})\mathbb{I}(g_{k}=g_{n})

(1)

where $\mu_{i_{n}}$ is the base intensity of user $i_{n}$ and $\eta_{g_{n}}$ is weight associated to $g_{n}$ towards a community $g_{n}$ , and $\boldsymbol{\eta}=\{\eta_{g}|g=1,...,M\}$ with $\eta_{g}\geq 0$ . $\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})$ is community specific intensity of user $i$ at the $n^{th}$ instance. We allow historical check-ins to contribute to the intensity - proportionate to their temporal and spatial proximity to $t_{n}$ and $\ell_{n}$ respectively, and weighted using the influence between user $i_{n}$ and $i_{k}$ (i.e., $A_{i_{k}i_{n}}$ ) - as long as they belong to the same community, enforced by the indicator function $\mathbb{I}(g_{k}=g_{n})$ . Here, $\kappa(t_{n}-t_{k},\ell_{n}-\ell_{k})$ is the triggering exponential kernel which factorises over time and location.

\kappa(t_{n}-t_{k},\ell_{n}-\ell_{k})=\kappa(t_{n}-t_{k})*\kappa(\ell_{n}-\ell_{k}),

(2)

where, $\kappa(t_{n}-t_{k})=\exp(-\nu(t_{n}-t_{k}))$ is the time specific triggering kernel with $\nu$ decay and $\kappa(\ell_{n}-\ell_{k})=\frac{1}{2\pi\,h}exp\left(-\frac{||\ell_{n}-\ell_{k}||}{2h}\right)$ is the location specific triggering kernel with $h$ bandwidth. When the decay parameter is low, the influence of the previous events is high and similarly when the bandwidth parameter is high the influence of previous locations is high.

In general, the intensity of a particular user $i$ at some time $t$ and location $\ell$ , is given as the sum of intensities that are estimated at the level of each community i.e. total intensity $\lambda_{i}(t,\ell)$ = $\sum_{g}\lambda_{i,g}(t,\ell)$

5.2 Category Distribution

The category $c$ associated with a check-in is represented as a $|V|$ -length vector and it represents one of $V$ possible categories²²2We overload the notation $c$ to also represent a scalar categorical value in the set $\{1,\ldots,V\}$ such as restaurant, entertainment etc. associated with the check-in. Also, we assume that the category depends on the underlying latent community associated with this check-in. For example, some community may be more inclined towards restaurants while another community is oriented towards sports. The category is modelled as a sample from a Multinomial (categorical) distribution,

c\sim Multinomial(\boldsymbol{\theta_{g}})

(3)

where $\boldsymbol{\theta_{g}}$ is a $|V|$ -length vector whose elements encode probability of each category and which depends on the community $g$ that the check-in belongs to. We assume a prior over $\boldsymbol{\theta_{g}}$ as a sample from Dirichlet distribution with parameters $\boldsymbol{\theta_{0}}$ . We write the conditional distribution $p(c,\boldsymbol{\theta_{g}}|\boldsymbol{\theta_{0}})$ as:

p(c,\boldsymbol{\theta_{g}}|\boldsymbol{\theta_{0}})=p(c|\boldsymbol{\theta_{g}})p(\boldsymbol{\theta_{g}}|\boldsymbol{\theta_{0}})=\theta_{g,c}\frac{\Gamma(\sum_{j}\theta_{0,j})}{\prod_{j}\Gamma(\theta_{0,j})}\prod_{j}\theta_{g,j}^{\theta_{0,j}-1}

(4)

$j$ runs over the $V$ categories and $p(c|\theta_{g})$ is given as $\theta_{g,c}$

5.3 Distribution over communities

We assume the latent variable $g$ (the communities) associated with the user for some check-in, is distributed as multinomial distribution parameterized by $\pi_{i}$ for a user $i$ . $\pi_{ig}$ represents the probability user $i$ belongs to community $g$ .

g\sim Multinomial(\boldsymbol{\pi_{i}})

(5)

5.4 Generative Process

1Initialize the number of communities

M

, and number of checkins

N_{i}

for each user;

2 Set

\mu_{i}

proptional to

N_{i}

;

3 Initialize

A_{ij}

as column normalized matrix;

4 Initialize

\pi

\eta

and

\theta

as Dirichlet-Multinomial distribution ;

5 Initialize

\lambda_{i}(t_{0},x_{0},y_{0})=\mu_{i}\quad\forall i=1,\ldots,U

;

6 for n = 1 to N do

7 Sample (

t_{n},\ell_{n}

) from

\sum_{i=1}^{U}\lambda_{i}(t,\ell)

;

8 Sample

i_{n}

from Multinomial (

\lambda_{1}(t_{n},\ell_{n})

\lambda_{2}(t_{n},\ell_{n})

, … ,

\lambda_{U}(t_{n},\ell_{n})

);

9 Sample

g_{n}

from a Multinomial (

\pi_{i_{n}}

);

10 Sample

c_{n}

from

Multinomial(\theta_{g_{n}})

(

\theta_{g}

is defined in section 5.2);

12 end for

Algorithm 1 Generative Process

Note that, for sampling (t,x,y), the thinning algorithm proposed in [17] is modified in order to sample location coordinates from discrete “venue” set rather the continuous space. First we consider a discrete set of locations $L$ for the user based on her region. We sample $(x^{\prime},y^{\prime})$ at $n^{th}$ iteration, from a Gaussian distribution centered at the previous coordinates in the $(n-1)^{th}$ iteration: $(x_{n-1},y_{n-1})$ . Once $(x^{\prime},y^{\prime})$ is sampled, the nearest coordinate in the $L$ is determined and returned as $(x_{n},y_{n})$ .

6 Estimation and Inference

Given the multi-dimensional Hawkes process model defined above, the joint probability density function over the check-in events $\mathbb{E}$ is given as:

\prod_{n=1}^{N}p(t_{n},l_{n},c_{n},g_{n}|i_{n})=\prod_{n=1}^{N}\biggl{(}(p(t_{n},l_{n}|i_{n},g_{n})\times p(g_{n}|i_{n}))\times p(c_{n}|g_{n},\theta)\biggr{)}

(6)

Here, $p(c_{n}|g_{n},\theta)=\theta_{g_{n},c_{n}}$ , where $c_{n}$ represents the category associated with the $n^{th}$ check-in, and $p(g_{n}|i_{n})=\pi_{i_{n},g_{n}}$ the probability that user $i_{n}$ belong to the community $g_{n}$ .

\prod_{n=1}^{N}p(t_{n},\ell_{n}|i_{n},g_{n})=\prod_{n=1}^{N}\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})\exp\biggl{(}-\sum_{i=1}^{U}\int\limits_{0}^{T}\int\limits_{\ell_{min}}^{\ell_{max}}\lambda_{i}(t,\ell)dtd\ell\biggr{)}

(7)

is the likelihood (event density) of generating the observations given the community and users in the interval $[(0,\ell_{min}),(T,\ell_{max})]$ . The first term in (7) provides the instantaneous probability of occurrence of the observed events and the second term provides the probability that no event happens outside these observations (survival probability) [6]. Thus, the complete joint log likelihood is:

\mathcal{LL}=\sum_{n=1}^{N}\bigg{(}\log\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})+\log\pi_{i_{n},g_{n}}+\log\theta_{g_{n},c_{n}}\bigg{)}\\ -\sum_{i=1}^{U}\int\limits_{0}^{T}\int\limits_{\ell_{min}}^{\ell_{max}}\lambda_{i}(t,\ell)dtd\ell

(8)

Assuming communities are known, we can estimate the model parameters $\mu$ , $\eta$ , $A$ , $\theta_{g}$ ’s and $\pi$ ’s by maximum likelihood estimation. We treat the kernel parameters, and the Dirichlet parameters as the hyper-parameters which we initialize to some fixed values. However, the communities are latent and the maximum likelihood estimation cannot be applied directly. This calls for the expectation maximization algorithm, where the parameters are estimated after integrating out the latent variables from the joint likelihood using the posterior distribution over the latent variables. In our case, the posterior distribution over latent communities is given as

p(g_{1},\ldots,g_{n}|\{t_{n},\ell_{n},c_{n},i_{n}\}_{n=1}^{N})\\ =\frac{\prod_{n=1}^{N}p(t_{n},\ell_{n}|i_{n},g_{n})\times p(c_{n}|g_{n},\theta)\times p(g_{n}|i_{n})}{\sum_{g_{1},\ldots,g_{n}}\prod_{n=1}^{N}p(t_{n},\ell_{n}|i_{n},g_{n})\times p(c_{n}|g_{n},\theta)\times p(g_{n}|i_{n})}

(9)

The posterior distribution over the latent communities cannot be obtained in closed form due to the intractable normalization constant (denominator term) which involves an exponential number of summation terms. Markov chain Monte Carlo methods [1] can be used to obtain samples from the posterior. However, these approaches are not scalable to large datasets [2] and becomes computationally expensive for use in LBSNs. To overcome this, we use a variational expectation maximization algorithm where we approximate the posterior over communities using a variational distribution and estimate the model parameters and variational parameters by maximizing a variational lower bound [38].

6.1 Variational Expectation Maximization

The latent variables $g_{n}$ ’s dependent on different types of feature set i.e. space, time through $p(t_{n},\ell_{n}|i_{n},g_{n})$ and semantics through $p(c_{n}|g_{n},\theta)$ . Though the prior over $g_{n}$ is conjugate to $p(c_{n}|g_{n},\theta)$ , it is not with respect to $p(t_{n},\ell_{n}|i_{n},g_{n})$ and hence the posterior over $g_{n}$ cannot be computed in closed form. Moreover, $g_{n}$ ’s are inter-dependent i.e. at current step $g_{n}$ it is dependent on history from $g_{1}$ to $g_{n-1}$ as well as the future ones i.e. $g_{n+1}$ . Thus marginalizing out over such interconnected latent variables to compute the normalization constant for the posterior is intractable. To this end we assume a variational distribution over $g_{n}$ ’s conditioned on the user $i_{n}$ . The conditional variational distribution over $g_{n}$ is considered to be a multinomial distribution with parameters $\boldsymbol{\phi_{i_{n}}}$ . The variational parameter $\boldsymbol{\phi_{i}}$ for a user $i$ represents the posterior probability distributions over the communities for the user as observed from the data.

q({g_{n}}|i_{n})=Multinomial(g_{n}|\boldsymbol{\phi_{i_{n}}})

(10)

The variational parameters can be learnt by minimizing the KL divergence between the variational posterior (10) and the exact posterior (9). However, a direct minimization of KL divergence is not possible due to the intractable posterior. Following variational inference approach [2], the variational parameters are learnt by maximizing a variational lower bound, Evidence Lower Bound (ELBO), which indirectly minimizes the KL divergence. ELBO is obtained by considering an expected value of the complete joint log likelihood w.r.t the variational distribution [2] and acts as a lower bound to the marginal likelihood or evidence (normalization constant of the posterior). Hence, ELBO is useful to learn the model parameters also in addition to the variational parameters. Using the variational distribution defined in (10) and the complete joint log likelihood (8), we obtain the ELBO as:

\mathscr{L}=\sum_{n=1}^{N}\bigg{(}\mathbb{E}_{q}[\log\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})]+\sum_{m=1}^{M}\phi_{i_{n},m}\log\pi_{i_{n},m}+\sum_{m=1}^{M}\phi_{i_{n},m}\log\theta_{m,c_{n}}\bigg{)}\\ -\sum_{i=1}^{U}\int\limits_{0}^{T}\int\limits_{\ell_{min}}^{\ell_{max}}\mathbb{E}_{q}[\lambda_{i}(t,\ell)]d\ell dt-\mathbb{E}_{q}[\log q]

(11)

Here, $\mathbb{E}_{q}$ represents the expectation with respect to the variational distribution $q$ defined in (10). We learn the variational parameters and the model parameters by maximizing the ELBO. Table 2 lists the model parameters and variational parameters to be learnt using ELBO. All the terms in the ELBO except the first term can be computed in closed form.

Table 2: Parameters to be estimated and whether a Hyperparameter

Par	Description	H
$\boldsymbol{\mu}$	Base Intensity
$\boldsymbol{\eta}$	Weight associated towards community
$A_{ij}$	Influence Matrix
$h$	Bandwidth (KDE)	$\checkmark$
$\nu$	Temporal Decay Parameter	$\checkmark$
$\boldsymbol{\theta_{0}}$	Dirichlet Prior: Category	$\checkmark$
$\boldsymbol{\theta_{g}}$	Multinomial Prior: Categories / community
$\pi$	Multinomial Prior: Communities (All users)
$\phi$	Variational Parameters: Communities (All users)

Since the first term in (11) cannot be computed in closed form, we approximate it using the samples from the variational posterior (10) (Monte-Carlo approximation).

\mathbb{E}_{q}[\log\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})]\approx\frac{1}{S}\sum_{s=1}^{S}\log\lambda_{i_{n},\boldsymbol{g^{(s)}}}(t_{n},\ell_{n})

(12)

where $\boldsymbol{g^{(s)}}$ represent the vector of $N$ samples sampled from the joint variational distribution over all the $g_{n}$ ’s, i.e. $q(\boldsymbol{g^{(s)}})=\prod_{i=1}^{N}q(g_{n}^{(s)}|\phi_{i_{n}})$ . This results in a stochastic variational lower bound where the stochasticity arises due to the approximation of expectation using Monte Carlo sampling [38]. We learn the model parameters $\mu$ , $\eta$ , $A_{ij}$ , $\theta$ by maximizing the stochastic variational lower bound [26] using gradient based methods. However learning the variational parameters is problematic as the variational parameters does not appear explicitly in the stochastic term but only through the samples. For determining gradient w.r.t. $\phi$ we apply the Reinforcement trick to the stochastic term and compute the gradient as follows [9]:

\bigtriangledown_{\phi}\mathbb{E}_{q}[\log\lambda_{i_{n},g_{n}}(t_{n},\ell_{n})]\\ \approx\frac{1}{S}\sum_{s=1}^{S}\log\lambda_{i_{n},\boldsymbol{g^{(s)}}}(t_{n},\ell_{n})\bigtriangledown_{\phi}\log q(\boldsymbol{g^{(s)}})

(13)

7 Experiments

Compared Baselines We empirically evaluate the performance of CoLAB ³³3We pledge to make our codes and datasets public. over synthetic and real datasets with the following baselines:

•

STHP: Spatio-Temporal Hawkes Process models the diffusion process across spatial and temporal dimensions but ignores the category associated location feature. This is the baseline derived from CoLAB ignoring the categories.
•

Sequence Mining [19]: First extracts frequent occuring venue category sequences and assigns communties based on clusters with similar patterns.
•

DH [8]: Dirichlet-Hawkes, clusters continuous time event streams using a modified Hawkes model with preferential cluster assignment through Dirichlet Process.
•

RMTPP:[7] Recurrent Marked Temporal Point Process model the time and the marker information by learning a general representation of the nonlinear dependency over the history based on recurrent neural networks. In this model, event history is embedded into a compact vector representation which is then used for predicting the next event time and marker type.

For an even comparison, we feed the check-in events to all the baselines and evaluate the community quality formed by DH and Sequence Mining along with location prediction for STHP and RMTPP.

7.1 Synthetic Data

We generate synthetic data using algorithm 1 (statistics in Table 4). The set of locations, #users to categories i.e. (U:V) and to the number of checkins i.e. (U:N) are kept similar to the real data collected from Brazil with $\nu$ (temporal decay parameter) is set to 0.01 [33] and $h$ is picked up from the bandwidth values learned from users’ checkins. During inference, we use true values for all the parameters, except for the influence matrix $A_{ij}$ and the user-community posterior $\phi$ which are estimated. We use $RelErr(A_{ij},\hat{A}_{ij})=\frac{1}{I^{2}}\sum_{i,j=1}^{I}\frac{|a_{ij}-\hat{a}_{ij}|}{|a_{ij}|}$ (and similarly for $\phi$ ), as the metric to evaluate the ability to recover the true values. Table 4 indicates that the stochastic variational inference technique offers considerably better reconstruction of the parameters, recording significant reductions in the error.

Table 3: Dataset Properties

Property	#Users(U)	#Communities(M)	#Categories(V)	#Check-ins(N)
Synthetic	100	10	200	9777
SA (Real)	95	-	314	15110
US (Real)	133	-	524	22059

Table 4:

RelErr

A

and

\phi

and True Positives for Location Prediction results at Top-K on Synthetic Data

Technique	$A$	$\phi$	5	10
	RelErr		Top K
STHP	0.99174	0.13807	681	1206
CoLAB	0.04813	0.07216	972	1677

7.2 Real Data

For real data, we use our crawls over Foursquare conducted between January-2015 and March-2016 and construct two collections consisting of check-ins from Saudi Arabia (SA) and United States (US), with details given in Table 4. We allocated first 80% (as per check-in timestamp) of each dataset to training and remaining for testing. Here, we also use temporal decay parameter as 0.01 [33] and $h$ is learnt for each user based on Silverman’s rule of thumb in kernel density estimation [1] and is fixed during the joint estimation.

7.2.1 Location Prediction with CoLAB

For location prediction task, we predict the next location from the previously seen locations in the training set at $M(\#communities)=10$ and at various top- $K$ ranked (eq. 8) cutoffs. Since, DH is for clustering event streams, therefore we make use of STHP and RMTPP as baselines for the Location Prediction task. Table 5 shows that CoLAB is able to offer significant improvements (18-37% in the top- $5$ ). As we increase $K$ , these gains diminish because the number of candidate locations saturates. We also study the effect of $A_{ij}$ (influence matrix) and $\mu$ (base intenisty) over CoLAB’s performance. It can be observed that without $A_{ij}$ , the CoLAB’s performance degrades signifying CoLAB’s ability to capture the underlying diffusion process well.

Table 5: Comparison of CoLAB with other baselines and CoLAB without

A_{ij}

and

\mu

Dataset	K	STHP	RMTPP	CoLAB w/o Aij	CoLAB w/o $\mu$	CoLAB
SA (#testcases = 2805)	5	287	250	279	331	339
	10	455	499	478	589	593
	20	950	951	911	1038	1043
	50	1539	1539	1520	1664	1666
US (#testcases = 4395)	5	153	172	172	231	237
	10	456	467	445	507	512
	20	870	888	919	920	927
	50	1700	1691	1759	1668	1673

7.2.2 Impact of #Communities

In figures 2 and 3 we study the impact of $M$ , over SA and US data we observe that with increasing $M$ the prediction accuracy improves and then diminishes, signifying optimal value of $M$ for better predictions. CoLAB performs significantly better than STHP, primarily due to the better estimation of $A_{ij}$ (influence matrix), because of the presence of category information in the CoLAB model.

7.3 Community Assessment

We plot communities over SA and US data in Figure 4, where colored dots represents a check-in. In figure 4, it can be observed that; (i) Overlap between communities due to data concentration in cities. (ii) CoLAB is able to capture communities across cities.

Unfortunately, we lack the community ground truth for users, making communities assessment a non-trivial task. Thus, we use a metric a joint loss function for the intra-community properties through a mixture of (i) category loss ( $\mathcal{L}_{cat}$ ) and (ii) location loss ( $\mathcal{L}_{loc}$ ). Category Loss: We consider all categories associated to locations as independent marks of a point process. Hence to estimate the category affinity in a community, we consider similarity among the check-in categories using pre-trained word embeddings [24] and devise a loss function.

\mathcal{L}_{cat}=\frac{1}{|\mathbf{T}|}\sum_{E_{n}\in\mathbf{T}}\sum_{g\in\mathcal{M}}\left\{1-\frac{\mathbf{v}_{E_{n}}\cdot\mu_{g}}{||\mathbf{v}_{E_{n}}||_{2}||\mu_{g}||_{2}}\right\}\cdot\Phi(E_{n},g)

(14)

where $\mathbf{T}$ represents test data, $\mu_{g}$ is category mean for a community $g$ using $K_{cat}$ frequent categories, $\mathbf{v}_{E_{n}}$ is the category vector for event $E_{n}$ ; $\Phi(E_{n},g)$ indicates whether $E_{n}$ is assigned to community $g$ , with $\mathcal{M}$ as all communities. Table 6 demonstrates CoLAB’s ability to capture category dynamics across communities. Although, it can be observed that at $K_{cat}$ = 10 Dirichlet Hawkes performs better because with most frequent categories like restaurant, coffee shop etc. DH assigns it most of the communities. Note that, DH is unable to capture communities with varied categories as seen at $K_{cat}$ = 50 and 100, that CoLAB performs better.
Location Loss:[4] show that users tend to visit nearby locations given we ignore the bias of loyalty. Hence ideally, a community of users not spatially dispersed in their checkin characteristics should be distinct from another community with checkins spanning large distances. We capture this through a distance based k-means loss ( $\mathcal{L}_{loc}$ ) with cluster means( $\mu_{l}$ ) for checkin coordinates for each community. In table 7 we can see CoLAB performs significantly better than other baselines because CoLAB can better capture the geographical dispersion in communities.

Table 6: Results for Category Loss

Dataset	$K_{cat}$	Sequence Mining		DH	STHP	CoLAB
		Daily	Weekly
SA	10	250.24	236.19	103.47	125.17	119.41
	50	1118.23	1089.27	862.05	842.63	826.72
	100	2007.93	2120.76	1983.61	1784.04	1749.37
US	10	248.45	217.56	98.37	118.32	113.86
	50	956.87	990.45	781.83	793.07	771.15
	100	1907.84	2020.49	1901.37	1605.64	1583.02

Table 7: Results for Location Loss

Datasets	Sequence Mining		DH	STHP	CoLAB
Datasets	Daily	Weekly
SA	600.67	547.56	413.16	306.73	298.09
US	1127.34	1067.50	1039.08	849.92	834.64

7.3.1 Qualitative Assessment

We claim that a user in a community will display an affinity towards certain categories. For US data, figure 5 shows even with highly overlapping venue categories, our model finds the intricate differences between a community with affinity to music (a) and a community with affinity towards food/bar joints. The word clouds for SA dataset shows similar properties and have been avoided for brevity.

8 Conclusion

In this paper, we presented CoLAB that uses spatio-temporal Hawkes process to infer the implicit communities using a novel stochastic variational inference technique. Empirical evaluations over synthetic as well as real-world datasets highlight its prowess with significant improvements in location and community detection tasks. This illustrates the effectiveness of the modelling used in CoLAB in generating user communities even in the absence of social connectedness information. In future work, we would like to explore scalability of CoLAB through sample-based inference techniques.

References

[1] Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
[2] Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: A review for statisticians. Journal of the American Statistical Association 112(518), 859–877 (2017)
[3] Bouros, P., Sacharidis, D., Bikakis, N.: Regionally influential users in location-aware social networks. In: SIGSPATIAL (2014)
[4] Cho, E., Myers, S.A., Leskovec, J.: Friendship and mobility: User movement in location-based social networks. In: SIGKDD (2011)
[5] Cho, Y.S., Galstyan, A., Brantingham, P.J., Tita, G.: Latent self-exciting point process model for spatial-temporal networks. vol. 19 (2014)
[6] Daley, D.J., Vere-Jones, D.: An introduction to the theory of point processes. Vol. I. Probability and its Applications (New York), Springer-Verlag, New York, second edn. (2003)
[7] Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., Song, L.: Recurrent marked temporal point processes: Embedding event history to vector. In: KDD (2016)
[8] Du, N., Farajtabar, M., Ahmed, A., Smola, A.J., Song, L.: Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: SIGKDD (2015)
[9] Gal, Y.: Uncertainty in Deep Learning. Ph.D. thesis, University of Cambridge (2016)
[10] Gao, H., Tang, J., Hu, X., Liu, H.: Modeling temporal effects of human mobile behavior on location-based social networks. In: CIKM (2013)
[11] Gao, H., Tang, J., Liu, H.: Exploring social-historical ties on location-based social networks. In: ICWSM (2012)
[12] Hawkes, A., Oakes, D.: A cluster process representation of a self-exciting process. Journal of Applied Probability 11(3), 493–503 (1974)
[13] Hu, W., Jin, P.J.: An adaptive hawkes process formulation for estimating time-of-day zonal trip arrivals with location-based social networking check-in data. Transportation Research Part C: Emerging Technologies 79, 136 – 155 (2017)
[14] Jankowiak, M., Gomez-Rodriguez, M.: Uncovering the spatiotemporal patterns of collective social activity. In: SDM (2017)
[15] Khorasgani, R.R., Chen, J., Zaïane, O.R.: Top leaders community detection approach in information networks. In: 4th SNA-KDD workshop on Social Network mining and Analysis. Citeseer (2010)
[16] Kumpula, J.M., Kivelä, M., Kaski, K., Saramäki, J.: Sequential algorithm for fast clique percolation. Physical Review E 78(2), 026109 (2008)
[17] Lewis, P.A.W., Shedler, G.S.: Simulation of nonhomogeneous poisson processes by thinning. Naval Research Logistics Quarterly 26(3), 403–413 (1979)
[18] Li, G., Chen, S., Feng, J., Tan, K.l., Li, W.: Efficient location-aware influence maximization. In: SIGMOD (2014)
[19] Li, H., Deng, K., Cui, J., Dong, Z., Ma, J., Huang, J.: Hidden community identification in location-based social network via probabilistic venue sequences. Information Sciences 422, 188 – 203 (2018)
[20] Lichman, M., Smyth, P.: Modeling human location data with mixtures of kernel densities. In: SIGKDD (2014)
[21] Likhyani, A., Bedathur, S., P, D.: Locate: Influence quantification for location promotion in location-based social networks. In: IJCAI (2017)
[22] Likhyani, A., Padmanabhan, D., Bedathur, S., Mehta, S.: Inferring and exploiting categories for next location prediction. In: WWW (2015)
[23] Liu, J., Li, Y., Ling, G., Li, R., Zheng, Z.: Community detection in location-based social networks: An entropy-based approach. In: IEEE CIT (2016)
[24] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: LREC (2018)
[25] Noulas, A., Scellato, S., Mascolo, C., Pontil, M.: Exploiting semantic annotations for clustering geographic areas and users in location-based social networks. In: The Social Mobile Web, ICWSM Workshop (2011)
[26] Paisley, J.W., Blei, D.M., Jordan, M.I.: Variational bayesian inference with stochastic search. In: ICML (2012)
[27] Prat-Pérez, A., Dominguez-Sal, D., Larriba-Pey, J.L.: High quality, scalable and parallel community detection for large real graphs. In: Proceedings of the 23rd international conference on World wide web. pp. 225–236. ACM (2014)
[28] Reinhart, A.: A review of self-exciting spatio-temporal point processes and their applications. Statistical Science (2018)
[29] Tran, L.Q., Farajtabar, M., Song, L., Zha, H.: Netcodec: Community detection from individual activities. In: SDM (2015)
[30] Wang, X., Zhang, Y., Zhang, W., Lin, X.: Distance-aware influence maximization in geo-social network. In: ICDE (2016)
[31] Wang, Z., Zhang, D., Yang, D., Yu, Z., Zhou, X.: Detecting overlapping communities in location-based social networks. In: Social Informatics (2012)
[32] Wu, H.H., Yeh, M.Y.: Influential nodes in a one-wave diffusion model for location-based social networks. In: PAKDD (2013)
[33] Yang, S.H., Zha, H.: Mixture of mutually exciting processes for viral diffusion. In: ICML. pp. II–1–II–9. ICML’13, JMLR.org (2013), http://dl.acm.org/citation.cfm?id=3042817.3042894
[34] Ye, M., Yin, P., Lee, W.C., Lee, D.L.: Exploiting geographical influence for collaborative point-of-interest recommendation. In: SIGIR (2011)
[35] Yuan, B., Li, H., Bertozzi, A.L., Brantingham, P.J., Porter, M.A.: Multivariate spatiotemporal hawkes processes and network reconstruction. SIAM Journal on Mathematics of Data Science (2019)
[36] Zarezade, A., Jafarzadeh, S., Rabiee, H.R.: Recurrent spatio-temporal modeling of check-ins in location-based social networks. PLOS ONE 13(5), 1–20 (05 2018)
[37] Zhang, C., Shou, L., Chen, K., Chen, G., Bei, Y.: Evaluating geo-social influence in location-based social networks. In: CIKM (2012)
[38] Zhang, C., Bütepage, J., Kjellström, H., Mandt, S.: Advances in variational inference. CoRR abs/1711.05597 (2017)
[39] Zhang, P., Wang, X., Li, B.: On predicting twitter trend: Factors and models. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. p. 1427–1429. ASONAM ’13, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2492517.2492576, https://doi.org/10.1145/2492517.2492576
[40] Zhao, F., Tung, A.K.: Large scale cohesive subgraphs discovery for social network visual analysis. Proceedings of the VLDB Endowment 6(2), 85–96 (2012)
[41] Zhou, K., Zha, H., Song, L.: Learning triggering kernels for multi-dimensional hawkes processes. In: ICML. vol. 28 (2013)
[42] Zhu, W.Y., Peng, W.C., Chen, L.J., Zheng, K., Zhou, X.: Modeling user mobility for location promotion in location-based social networks. In: SIGKDD (2015)

Modeling Implicit Communities using Spatio-Temporal Point Processes from Geo-tagged Event Traces