This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Birdspotter: A Tool for Analyzing and Labeling Twitter Users

Rohit Ram University of Technology SydneySydneyAustralia [email protected] Quyu Kong Australian National University &
UTS & Data61, CSIRO
CanberraAustralia
[email protected]
 and  Marian-Andrei Rizoiu University of Technology Sydney & Data61, CSIROSydneyAustralia [email protected]
(2021)
Abstract.

The impact of online social media on societal events and institutions is profound, and with the rapid increases in user uptake, we are just starting to understand its ramifications. Social scientists and practitioners who model online discourse as a proxy for real-world behavior often curate large social media datasets. A lack of available tooling aimed at non-data science experts frequently leaves this data (and the insights it holds) underutilized. Here, we propose birdspotter – a tool to analyze and label Twitter users –, and birdspotter.ml – an exploratory visualizer for the computed metrics. birdspotter provides an end-to-end analysis pipeline, from the processing of pre-collected Twitter data to general-purpose labeling of users and estimating their social influence, within a few lines of code. The package features tutorials and detailed documentation. We also illustrate how to train birdspotter into a fully-fledged bot detector that achieves better than state-of-the-art performances without making Twitter API calls, and we showcase its usage in an exploratory analysis of a topical COVID-19 dataset.

Twitter user analysis, Bot detection, Online influence
journalyear: 2021copyright: acmcopyrightconference: Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining; March 8–12, 2021; Virtual Event, Israelbooktitle: Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event, Israelprice: 15.00doi: 10.1145/3437963.3441695isbn: 978-1-4503-8297-7/21/03ccs: Networks Online social networksccs: Human-centered computing Social mediaccs: Information systems Open source software

1. Introduction

Refer to caption
Figure 1. The birdspotter.ml visualization system: Twitter users are plotted based on their user influence and botness (left panel), and we show a selected user’s profile (top-right) and cascade history (bottom-right).

Barely a decade old, social media in general — and Twitter in particular — are becoming increasingly important in shaping societal events. They serve as novel fora for a wide array of users to express themselves, discuss, promote agendas and attempt to influence the said societal events. As a result, social and political scientists, journalists, and communication scientists increasingly turn to social media as a proxy to study society. They carefully curate and label large social media datasets, and here a gap emerges. There is a limited offer of tools aimed at non-machine learning experts to analyze users in already existing datasets without making additional web API calls that limits the amount of retrieved data. This paper fills this gap by proposing birdspotter, a package aimed at non-computing practitioners with quantitative expertise (basic R or Python), to analyze, describe, and automatically label users in Twitter datasets.

11footnotetext: birdspotter source code, tutorial, and feature list: https://github.com/behavioral-ds/BirdSpotter22footnotetext: birdspotter.ml public installation: https://www.birdspotter.ml33footnotetext: birdspotter documentation: https://birdspotter.readthedocs.io44footnotetext: COVID-19 tutorial: https://github.com/behavioral-ds/user-analysis55footnotetext: Supplementary Material: https://arxiv.org/pdf/2012.02370.pdf#page=5

This work addresses three specific open questions concerning analyzing Twitter users. The first question relates to the availability of user analysis tools. Existing tools are typically designed for Twitter branding and management, i.e. to either analyze a user’s or organization’s account (Twitter Analytics1{}^{\ref{fn:twitter-analytics}}, or Brandwatch Consumer Research1{}^{\ref{fn:brand-watch}}), or one given user account (Account Analysis Tool1{}^{\ref{fn:account-analysis}}). The question is whether a tool exists to retrospectively analyze and label all the users in Twitter dumps, aimed at non-data science experts with computational expertise? We address this question by proposing birdspotter1{}^{\ref{fn:birdspotter-source}}, an integrated Twitter user analysis tool, that can achieve three types of analysis in only a couple of lines of code. First, it processes existing Twitter datasets (e.g. jsonl data dumps collected through the Streaming API). Second, it describes users using three types of features (relating to the user, content semantics, and hashtag usage). Last, it allows training a classifier against a labeled user subset, which turns birdspotter into a general-purpose inferential user analysis tool.

66footnotetext: Twitter Analytics: https://analytics.twitter.com/77footnotetext: Brandwatch Consumer Research: https://www.brandwatch.com88footnotetext: Account Analysis Tool: https://accountanalysis.app

The second open question relates to profiling user botness and influence on previously collected data. The state-of-the-art bot detector, botometer (sayyadiharikandeh2020detection), can only be accessed through its web APIs and cannot produce predictions for users that are no longer accessible, such as suspended accounts. Since bots have a high tendency of being suspended by Twitter, measuring botness a while after collecting data risks missing a large proportion of the bots involved in discussions. Similarly, existing influence estimation tools require knowledge of the social graph, which often is impossible to capture retrospectively. The question is: can we design a tool that quantifies users’ botness and influence on existing curated datasets, without the need of online API calls or supplementary information? We address this question two-fold. First, using four existing Twitter bot datasets, we train birdspotter to detect bots without requiring additional API calls. We show that birdspotter achieves a higher performance than the current state-of-the-art botometer (sayyadiharikandeh2020detection); birdspotter ships the bot detector by default, with the package. Second, we implement a diffusion-based influence estimation (Rizoiu2017a), which is as accurate as using the social graph.

The third open question is can we visualize and explore both broad and specific views of Twitter users and their activity? We address this question by proposing birdspotter.ml1{}^{\ref{fn:birdspotter.ml}}, a tool that provides both broad views of the user population and detailed inspections of user activity (see Fig. 1 for the main interface).

The main contributions of this paper are as follows:

  • \bullet

    birdspotter1{}^{\ref{fn:birdspotter-source}} — a software package designed for inferential analysis of online users in pre-collected data, and to estimate online user influence based on the reshare cascades.

  • \bullet

    birdspotter.ml1{}^{\ref{fn:birdspotter.ml}} — an online visualizer designed to perform exploratory analysis of Twitter users.

  • \bullet

    an offline bot detector, built using four public labeled datasets; we show that it achieves better than state-of-the-art performance and we showcase it on an example analysis of users discussing COVID-191{}^{\ref{fn:online-tutorial}}.

Related work. Here, we present the prior work most relevant to birdspotter. For a complete related work discussion, please refer to the online appendix1{}^{\ref{fn:supp-material}}.

Tree-based ensemble methods dominate social bot detection (over deep learning) due to the heterogeneity of bots and the relative sparsity of training data. The de-facto bot detection tool is botometer (formerly BotOrNot(botornot), which uses more than 1000 user- and recent activity-related features to train a Random Forest classifier. The main limitations of botometer are 1) usage of online APIs which are rate-limited by Twitter, 2) lack of reproducibility, since deactivated, protected, and suspended users can no longer be retrieved, and 3) botometer scores are likely to vary with user activity and botometer versioning. Birdspotter addresses the above by predicting bots on pre-collected Twitter dumps.

User influence is typically measured using static user attributes (cossu2016review), analyzing the online social graph (riquelme2016), and modeling information diffusion (zhang2019). Closest to our work is ConTinEst (Gomez-Rodriguez2016), which requires knowledge of the social graph (often prohibitively expensive to obtain) on which it performs random walks (very slow on large social graphs). Birdspotter estimates user influence from resharing dynamics in the absence of knowledge about the social graph.

2. Preliminaries

In this section, we briefly outline prerequisites concerning influence estimation using point-process models. For a thorough construction of the influence estimation, please refer to the online appendix1{}^{\ref{fn:supp-material}}.

User influence estimation. birdspotter implements the algorithm in (rizoiu2018debatenight), estimating online influence as the mean number of retweets generated, directly and indirectly, by a user’s (re)tweet. rizoiu2018debatenight estimate user influence, absent of the retweet branching structure, by assuming that retweets arrive following a Hawkes point process (Rizoiu2017a). They estimate the probability that the tweet vj=(mj,tj)v_{j}=(m_{j},t_{j}) is a direct retweet of viv_{i} as pij=ϕ(mi,tjti)k=1j1ϕ(mk,tjtk)p_{ij}=\frac{\phi(m_{i},t_{j}-t_{i})}{\sum_{k=1}^{j-1}\phi(m_{k},t_{j}-t_{k})}, where mjm_{j} is the associated user’s follower count, tjt_{j} is the time of the event, and ϕ(m,Δt)=κθmβeθΔt\phi(m,\Delta t)=\kappa\theta m^{\beta}e^{-\theta\Delta t} is the marked Hawkes exponential kernel of parameters κ\kappa, β\beta and θ\theta. The pairwise influence represents the probability that viv_{i} indirectly generates vjv_{j}, and is computed as rij=k=ij1rikpkjr_{ij}=\sum_{k=i}^{j-1}r_{ik}p_{kj} when i<ji<j, rii=1r_{ii}=1, and is 0 otherwise. Furthermore, a tweet’s influence is the sum of its pairwise influences, and a user’s influence is its tweets’ influences averaged.

3. Package Overview

In this section, we give an overview of birdspotter and birdspotter.ml, and describe their usage, functionalities, and design.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2. (a) Mean AUC +/- standard deviation, varying ablated models and botometer. Models/Features are indicated by BS (birdspotter), BT (botometer), HT (Hashtags), SM (Semantic),and TU (Twitter User). (b) Mean F1F_{1} score versus bot threshold for birdspotter and botometer. (c) SHAP summary plot where points indicate classifier decisions, y-axis shows features in decreasing importance, x-axis shows SHAP impact value, and color indicates feature value. Positive SHAP indicates bots.

3.1. birdspotter

birdspotter labels users and measures influence on previously collected tweets in the standard jsonl or json format.

Measuring influence. birdspotter measures user influence as outlined in Section 2, using by default a marked Hawkes exponential kernel with parameters β=1\beta=1, κ=1θ\kappa=\frac{1}{\theta} and θ=6.8×104\theta=6.8\times 10^{-4}. These were tuned on a large collection of real cascades (rizoiu2018debatenight), and can be customized using the function getInfluenceScores().

Usage and functionalities. Given a dataset of tweets collected externally (e.g. leveraging the Twitter Filter API), birdspotter’s core functionality revolves around two steps. In the first step, birdspotter loads the Twitter dataset, extracts retweet cascades, and compiles the user-level information. In the second step, it performs the influence analysis and user labeling. The former is achieved by simply invoking the BirdSpotter constructor, while the latter is achieved by calling the function getLabeledUsers(), which returns a table with the user features detailed above. For every observed cascade, birdspotter also computes the most likely branching structure (see pijp_{ij} in Section 2). This can be achieved using the function getCascadesDataFrame(), which returns the reshare cascades (i.e. original tweet and all its retweets) with the additional column expected_parent indicating a retweet’s most likely parent tweet.

For power users, birdspotter provides a number of robust configurations — such as changing the parameters of the Hawkes kernel or using user-defined word embeddings — documented using its readthedocs1{}^{\ref{fn:birdspotter-doc}} documentation. A usage tutorial is available on birdspotter’s repository1{}^{\ref{fn:birdspotter-source}}. For users who prefer to analyze the results outside python, birdspotter can dump the user table and the reshare cascades in Comma Separated Values (CSV) files, that can be loaded in outside tools. All birdspotter functionalities can be accessed in R via reticulate (https://github.com/rstudio/reticulate).

Feature Construction. birdspotter constructs user features1{}^{\ref{fn:birdspotter-source}} in three categories: Twitter user, semantic, and topic-based features. Twitter user features are engineered directly from twitter user attributes and capture heuristics of common bot behavior. Semantic features are constructed (by default) from FastText 300d word2vec embeddings (mikolov2018advances) of users’ tweets content and descriptions. Content embeddings are averages of tweet embeddings, which are averages of word embeddings. Topic-based features are the vectors of the 1,000 most frequent hashtags, scored for each user using the term frequency-inverse document frequency scheme. birdspotter is designed to be easily extended with any arbitrary (numerical) features to allow for rapidly evolving bot strategies (yang2019arming).

User labeling. birdspotter implements a supervised labeler. It engineers a large selection of features, and it uses a Gradient Boosting Machine model (XGBoost (xgboost) implementation), with hyperparameters tuned via Random Search and 5-fold cross-validation.

Beyond bot prediction. Birdspotter ships by default a pre-trained bot classifier (see Section 4), however birdspotter can be customized to a particular application or dataset through labeling and re-training. The function getBotAnnotationTemplate() outputs a CSV that can be annotated by the user, and trainClassifierModel() re-trains the classifier with this annotated data. An option controls whether the model is further tuned starting from the existing model (useful for adapting bot detection to a particular dataset) or retrained from scratch. We exemplify this in Section 4.

Data Structures. birdspotter’s main class, called BirdSpotter, is used to access methods and attributes.

birdspotter makes accessible three pandas dataframes through the main object after processing: featuresDataframe (users and their extracted features), cascadeDataframe (tweets and cascade information), and hashtagDataframe (TF-IDF of hashtags).

Performance. birdspotter performed the extraction, processing, and profiling of a dataset of 196,269 tweets and 129,778 users, in just 5.7 ms per tweet, with an Intel Xeon W-2145 CPU.

Installation. birdspotter installs in the canonical Python way: pip install birdspotter.

Refer to caption
(a)
Refer to caption
(b)
Figure 3. Quantifying user botness and influence analysis on COVID-19 dataset. (a) Code required to load a Twitter dump, generate cascade and user information, annotate and fine-tune the bot classifier. (b) A density plot of user botness scores, and complementary cumulative density plots (CCDF) of user influence and user activity. The red lines show the mean values.

3.2. birdspotter.ml visualiser

birdspotter.ml1{}^{\ref{fn:birdspotter.ml}} is a visualizer built on top of birdspotter, and designed to analyze Twitter users engaged in online discussions. The visualisation provides both broad and specific views of the data, via the three components shown in Fig. 1: a scatter plot component, a user information component, and a cascade view component.

The Scatter Plot. The left panel contains the scatter-plot showing the influence percentile (on the y-axis) and botness (on the x-axis) of a random sample of the users from the dataset, and the underlying 2-D density over the entire data set. Users are colored based on the hashtag they use most and, when selected, the user and cascade views are populated. The plot is pan-able and zoom-able. The view starts with a random sample of 1,000 users and is dynamically populated as practitioners explore cascades.

The User View. The top-right panel provides information about a selected user, including their Twitter image (hyperlink to the user’s profile), screen name, location, the hashtags they used, and basic Twitter metrics (such as the number of followers or tweets).

The Cascade View. The bottom-right panel shows the cascades the selected user participates in, which are select-able via a carousel. The component shows the text of the original tweet, the retweets’ timing, and the most-likely branching structure inferred using birdspotter. The points on this component are select-able and hover-able in the same way as the scatter plot. The component also is pan-able and zoom-able.

4. Building a bot detector

In this section, we train birdspotter as a bot classifier with better performances than the state of the art botometer. We showcase birdspotter to profile a topical COVID-19 Twitter dataset.

Training data. birdspotter provides the functionality to retrain and update the current model, which we leverage to build a bot detector. We train on four public bot datasets, including {\{botometer‐feedback-2019, political-bots-2019}\} (yang2019arming), and {\{verified-2019, botwiki-2019}\} (yang2020scalable), sourced from Bot Repository999available from https://botometer.osome.iu.edu/bot-repository/datasets.html.

Training. The Bot Repository only provides account-level data, whereas birdspotter is designed to utilize tweet jsonl. We use the tool twarc to acquire the timeline of each available user’s first 200 tweets, to construct jsonl training data. We extract and preprocess the data with BirdSpotter(), label the resulting dataframe with users’ ground truth values, and run trainClassifierModel() on this training data to acquire our final model. We ship this model as the default at birdspotter’s installation.

Botometer comparison. We compare the derived model against botometer, by acquiring their bot scores (universal CAP (yang2019arming)) for available users through their API. Fig. 2(a) shows that birdspotter out-performs botometer in terms of mean AUC, despite using less information to make predictions – botometer uses more user features extracted from the online API. Fig. 2(b) shows that birdspotter consistently out-performs botometer with respect to mean F1F_{1} scores, over all bot score thresholds.

Ablation study. We test the importance of each set of features through various ablations of our main model. Fig. 2(a) shows the mean AUC obtained for subsets of features. It shows that Twitter user features and semantic features are both informative of bot-like behavior, while hashtag features show more variation. The hashtag model performance may be an artifact of training on the mixture of bot datasets (containing hashtags relating to different topics). We retain hashtag features in birdspotter, for better generalizability when users train and test on their own domain datasets. The best performing model uses Twitter user features and semantic features.

SHAP analysis. We use shap (NIPS2017_7062) for explaining the impact of features in our tree ensemble model. Fig. 2(c) shows that the Twitter user features form the majority, and semantic features a minority of the impactful features, in line with the ablation study.

COVID-19 Application Dataset. We apply birdspotter to a COVID-19 dataset (chen2020tracking), supplied as tweet IDs which were re-hydrated with twarc to a jsonl format, recovering 68.8%68.8\% . We limit our analysis to the 1.5\sim 1.5M unique tweets relating to posts on January the 31st, resulting in 0.28\sim 0.28M users and 0.42\sim 0.42M cascades.

Dataset profiling. Fig. 3(b) shows the empirical distributions of botness, influence, and activity (i.e., the number of cascades a user participates in). The distribution of botness indicates two maxima; the larger indicating the humans and the smaller indicating the bots . Conforming with the literature (rizoiu2018debatenight), influence and activity are long-tailed (following a “rich-get-richer” paradigm).

(Re-)Labeling Users. Exploring birdspotter.ml we observe humans — @DumplingSays, @eddfuentess, and @marat_dospolov — with bot scores of 0.8730.873, 0.830.83, and 0.9250.925 respectively. Using getAnnotationTemplate (see Fig. 3(a), line 8) we label each user as human, and update the classifier with trainClassifier (Fig. 3(a), line 10). The new bot scores are 0.3750.375, 0.2960.296, and 0.5590.559, respectively. Practitioners can use birdspotter to classify any latent user attribute.

5. Conclusion

We presented birdspotter, a Twitter user analysis tool aimed at non-data science experts who analyze discourse and user activity on social media. It provides an end-to-end analysis of users’ online characteristics, and populates a visualizer facilitating both broad views of a user population and individual exploration. As with many open-source classifiers, we know that birdspotter could be leveraged to infer sensitive features. However, we are currently not aware of any protections that we could implement to prevent this.

Tools like birdspotter are integral to the timely, performant, and reproducible analysis of social media users for understanding discourse and society.

Acknowledgments

This research was partially funded by Facebook Research under the Content Policy Research Initiative grants and the Defence Science and Technology Group of the Australian Department of Defence.

References

Accompanying the submission Birdspotter: A Tool for Analyzing and Labeling Twitter Users.

Appendix A Additional Related Work

In this section, we outline other approaches to bot detection and influence measurement in the literature.

Detecting Twitter bots. There have been a myriad of approaches to detect bots on Twitter. There are three motifs within the literature. The first motif are supervised methods used to determine if an individual user is a bot, usually employing feature construction. Such approaches include NLP approaches (knauth2019; clark2016), deep-learning approaches (kudugunta2018), feature-engineering (chu2013; yang2014; botornot) and other methods (mazza2019; ferrara2016). The second motif are unsupervised methods used to discover coordinated online behavior/real-time online campaigns; and the third motif are adversarial methods which achieve better bot detection by generating better bots.

birdspotter falls in the first category. It uses a supervised approach to retrospectively analyze datasets. It satisfies a different use case than coordinated online behavior tools like BotSlayer (botslayer). Adversarial approaches are fairly novel, however it is unclear whether they might simply improve bot technology, as they provide recipes to build better bots.

The de-facto bot detection tool in the social science community is Botometer (formerly BotOrNot(botornot), which uses more than 1000 user- and recent activity-related features to train a Random Forest classifier. Botometer is currently at version 4, at the time of writing, and serves half a million queries a day (sayyadiharikandeh2020detection).

The main limitation of botometer for practitioners is its dependence on an online API. It cannot be used to profile the users in offline Twitter datasets which have been collected in the past (like used in many works (wojcik2018; bessi2016; ferrara2020)). Furthermore, the API is rate-limited by Twitter, and requires registration through both Twitter and RapidAPI service. For scientific purposes, botometer makes local reproducibility difficult to achieve, since deactivated, protected, and suspended users can no longer be retrieved, and botometer scores are likely to vary with user activity and botometer versioning.

Birdspotter addresses the above-stated shortcomings by producing bot predictions on already collected Twitter dumps, and exposing a simple interface to allows researchers to annotate their own Twitter user collection.

Tools for quantifying online influence. There are many features used to score the influence, reputation or popularity of online users. We delineate these into three areas: those using static user attributes (including lexical features and information on a user’s profile) (cossu2016review), those that analyze the online social graph (e.g. degree, PageRank, HITs, etc.) (riquelme2016; cha2010), and those modeling information diffusion (zhang2019). However, few of these have translated into accessible tools for the non-experts in the field. For instance, cossu2016review provide a set of scripts to perform their influence measurement method. Other tools, like ConTinEst (Du2013; Gomez-Rodriguez2016), require knowledge of the social graph (which is often prohibitively expensive to obtain) on which it performs random walks (which are very slow on large social graphs). Birdspotter estimates user influence from reshare dynamics, in the absence of knowledge about the social graph, and provides an end-to-end tool to analyze Twitter users.

Appendix B Influence measure

We review the theoretical prerequisites concerning modeling reshare cascades using point processes, and estimating reshare influence.

Reshare cascades. birdspotter analyzes the spread of online information in the form of online reshare cascades. A reshare cascade consists of an initial user post and some reshare events of the post by other users. On Twitter, for example, this can happen when users use the retweet functionality. We denote a cascade observed up to time TT as (T)={t0,t1,}\operatorname{\mathcal{H}}(T)=\{t_{0},t_{1},\dots\} where ti(T)t_{i}\in\operatorname{\mathcal{H}}(T) are the event times relative to the first event (t0=0t_{0}=0). We denote cascades with additional information about events — dubbed here as event marks — as marked cascades. We use the notation m(T)={(t0,m0),(t1,m1),}\operatorname{\mathcal{H}}_{m}(T)=\{(t_{0},m_{0}),(t_{1},m_{1}),\dots\}, where each event is a tuple of the event time and the event mark. For example, for retweet cascades, the numbers of followers of a Twitter user are commonly adopted as event marks (Zhao2015SEISMIC:Popularity; Mishra2016FeaturePrediction; Mishra2018ModelingPopularity).

The Hawkes processes. birdspotter models reshare cascades using Hawkes processes (hawkes1971spectra) — a type of point processes with the self-exciting property, i.e., the occurrence of past events increases the likelihood of future events. The occurrence of events in a Hawkes process is controlled by the event intensity function:

(1) λ(t(T))=μ(t)+ti<tϕ(tti)\lambda(t\mid\operatorname{\mathcal{H}}(T))=\mu(t)+\sum_{t_{i}<t}\phi(t-t_{i})

where μ(t)\mu(t) is the background intensity function and ϕ:++\phi:\operatorname{\mathbb{R}}^{+}\rightarrow\operatorname{\mathbb{R}}^{+} is a kernel function capturing the decaying influence from a historical event. We note that, for reshare cascades, all events are considered to be offspring of the initial event, i.e. there is no background event rate μ(t)=0\mu(t)=0. Two widely adopted parametric forms for the kernel function ϕ\phi include the exponential function ϕEXP(t)=κθeθt\phi_{EXP}(t)=\kappa\theta e^{-\theta t} and the power-law function ϕPL(t)=κ(t+c)(1+θ)\phi_{PL}(t)=\kappa(t+c)^{-(1+\theta)}.

Marked Models. birdspotter implements marked versions of the point processes, where the mark is the number of followers that the user emitting the tweet has. This is because the mark of each event governs the number of future events, e.g., a tweet from a largely followed user is likely to attract more retweets. The marked versions of Hawkes processes (Mishra2016FeaturePrediction) are then derived by rescaling the kernel functions with the marks, i.e., ϕ(m,t)=mβϕ(t)\phi(m,t)=m^{\beta}\phi(t); β\beta controls the warping effect of the mark.

User influence estimation. birdspotter adopts the following definition for user influence, widely used in literature (Du2013; Zarezade2017; rizoiu2018debatenight):

Definition B.1.

Online user influence φ(u)\varphi(u) is defined as the mean number of reshares generated directly and indirectly by a message posted by uu, irrespective if it is an original message or a reshare.

Estimating influence from retweet cascades has the additional difficulty of not observing the branching structure of the diffusion — i.e., the Twitter API attributes all retweets to the original tweet. birdspotter estimates Twitter user influence using only the observed retweet cascade m(T)={v0=(t0,m0),v1=(t1,m1),}\operatorname{\mathcal{H}}_{m}(T)=\{v_{0}=(t_{0},m_{0}),v_{1}=(t_{1},m_{1}),\dots\}, where marks correspond to users’ number of followers.

rizoiu2018debatenight propose a method to estimate user influence in the absence of the branching structure by assuming that retweets arrive following a Hawkes point process (Rizoiu2017a). We can quantify the probability that an event vjv_{j} is generated by a previous event viv_{i} as the ratio of the event intensity generated by viv_{i} and the total intensity at time tjt_{j}. Formally, the probability vjv_{j} retweets viv_{i} is

(2) pij=ϕ(tjti)k=1j1ϕ(tjtk)p_{ij}=\frac{\phi(t_{j}-t_{i})}{\sum_{k=1}^{j-1}\phi(t_{j}-t_{k})}

rizoiu2018debatenight also introduce the pairwise influence score mijm_{ij}, intuitively defined as the amount of influence that viv_{i} exerts over vjv_{j} either directly (when vjv_{j} is a direct retweet of viv_{i}) or indirectly (when vjv_{j} is a retweet of a descendant of viv_{i}):

(3) mij={k=ij1mikpkj,ik<j1,i=j0,i>j,m_{ij}=\begin{cases}\sum_{k=i}^{j-1}m_{ik}p_{kj}&,i\leq k<j\\ 1&,i=j\\ 0&,i>j\end{cases}\enspace,

Finally, the influence of viv_{i} is φ(vi)=k=inmik\varphi(v_{i})=\sum_{k=i}^{n}m_{ik}, and the influence of a user uu is the average of the influences of all of their tweets:

(4) φ(u)=v𝒯(u)φ(v)|𝒯(u)|\varphi(u)=\frac{\sum_{v\in\mathcal{T}(u)}\varphi(v)}{|\mathcal{T}(u)|}

where 𝒯(u)\mathcal{T}(u) is the set of all the tweets emitted by user uu.

References