Tweet Insights: A Visualization Platform
to Extract Temporal Insights from Twitter
Abstract
This paper introduces a large collection of time series data derived from Twitter, postprocessed using word embedding techniques, as well as specialized fine-tuned language models. This data comprises the past five years and captures changes in n-gram frequency, similarity, sentiment and topic distribution. The interface built on top of this data enables temporal analysis for detecting and characterizing shifts in meaning, including complementary information to trending metrics, such as sentiment and topic association over time. We release an online demo for easy experimentation, and we share code and the underlying aggregated data for future work. In this paper, we also discuss three case studies unlocked thanks to our platform, showcasing its potential for temporal linguistic analysis.
1 Introduction
Social media platforms are fundamental tools in today’s society, and Twitter, in particular, has emerged as a thermometer for modern communication. While there have been numerous research studies targeting Twitter for societal investigations Zhuravskaya et al. (2020); Wohn and Na (2011); Lim et al. (2015), analysing it at a large scale, and for a prolonged period is not an easy task. This requires both a compilation of a large general-purpose corpus, as well as the usage of cutting-edge NLP techniques for data analysis. Moreover, the noisiness and informal nature of the platform make this task even especially challenging Baldwin et al. (2013); Morgan and Van Keulen (2014).
From a linguistic perspective, social media offer a valuable corpus for analysing word meaning change over time Del Tredici et al. (2019); Loureiro et al. (2022b), alongside tracking changes related to word usage. Several factors can influence language and topic discussions on Twitter, such as the release of a new movie (e.g., Parasite), the emergence of a new meaning for a certain word (e.g., delta), or an event that causes a sudden change in the sentiment toward a particular entity (e.g., Alec Baldwin).
In this paper, we present Tweet Insights, a platform for analysing Twitter textual data from a temporal perspective. Tweet Insights is released as a set of pre-computed time series and an online demo111A short video showcasing the online demo is available at https://tinyurl.com/ycxwh9mu for analysing historical multi-faceted word-based statistics in real time. With Tweet Insights, non-expert users can gather temporal statistics for word frequency, embedding similarity, sentiment, and topics.
2 Related Work
Numerous studies have focused on using Twitter as a data source in Computational Social Science research, including statistical tools helping in this regard O’Connor et al. (2010). Recently, Pfeffer et al. (2023) analysed all tweets posted on the social platform during a particular day (375M), with the goal of better understanding bot activity and user demographics, among other aspects. Most similar to our work, Storywrangler Alshaabi et al. (2021) continuously collects a large sample of tweets and provides several visualizations related to word frequency through an online interface.
While there are commercial tools in the intersection between advanced social media monitoring, decision-making and market intelligence (e.g., Bloomberg Terminal Dredze et al. (2016), FactSet or Thomson Reuters Eikon), these applications are often circumscribed to the financial domain, and most importantly, are not open-source and have a high cost222A Bloomberg Terminal license, for example, costs around USD 25,000.. To our knowledge, Hedonometer Dodds et al. (2015) is the only non-commercial documented tool analysing aspects complementary to popularity at scale, namely, tracking positive sentiment on the platform. In terms of language evolution, the research community has notably used Twitter to analyse meaning shift using distributional methods Mitra et al. (2014); Kulkarni et al. (2015), although at a smaller scale than the previously mentioned related works. As such, our work is the first attempt at deriving deeper insights at scale from this social platform, leveraging state-of-the-art language models and embedding methods to provide a more holistic understanding beyond surface-level count-based metrics.
3 Tweet Insights Development: Data and Models
This work proposes a set of four time series designed for exploring meaning shifts through different facets, among other potential applications. In this section, we describe the data and methods used to derive each of them.
3.1 Corpus
We use the Twitter Academic API to collect a total of 220M English tweets, covering the period between the start of 2018 and the end of 2022. We follow the process described in Loureiro et al. (2022a) to compile a representative sample of tweets using the Twitter API – ignoring retweets, media tweets, and requiring stopwords in the tweet’s text to increase the likelihood of sampling tweets with meaningful content. This process results in a collection of tweets distributed uniformly across different time scales (month, day, and hour), with the exception that more recent months sample additional intervals (see Appendix A for more details). Tweets are preprocessed by anonymizing non-verified333Only ‘legacy’ verified users, collected on 2022-11-09. user mentions, and removing URLs.
Word selection.
Once the corpus is compiled, our goal is to select a representative set of words from the corpus. To this end, we tokenize tweets using NLTK’s TweetTokenizer Bird et al. (2009), and recognize n-grams using the statistical scoring method proposed in Mikolov et al. (2013), through the implementation available from Řehůřek and Sojka (2010, Gensim v4.1.2)444Recognizing bigrams and trigrams with parameters: max_vocab_size=100M, min_count=10, threshold=10, ignoring connector words.. For methods that rely on co-occurrence metrics, namely n-gram recognition (and associated frequencies), and learning diachronic word embeddings, we use lowercased texts. See Table 1 for more details about corpus size and a breakdown by token types. While casing is preserved for classification methods, for improved consistency across time series, and given the informal nature of Twitter, resulting statistics are aggregated for lowercased n-grams.
Tweets | 220,829,111 (220M) |
---|---|
Tokens | 4,186,915,103 (4B) |
Token Types | |
Unigrams | 321,073 |
Bigrams | 123,604 |
Trigrams | 12,946 |
Hashtags | 170,138 |
Usernames | 96,639 |
Language models.
As a byproduct of our demo, we also release two RoBERTa language models (base555https://huggingface.co/cardiffnlp/twitter-roberta-base-2022-154m and large666https://huggingface.co/cardiffnlp/twitter-roberta-large-2022-154m) trained on the collected Twitter API data. These models are trained on a sample of 154M tweets covering the periods between 2018-01 and 2022-12. This sample results from filtering our corpus of 220M tweets, following the preprocessing procedure described in Loureiro et al. (2022a), which removes near-duplicates and tweets from the top most popular 1% users (i.e., potential bots). For training, we followed the same procedure of Barbieri et al. (2020) by training the original checkpoints until converging on an validation set of 50k tweets.
3.2 Time series
Once the reference corpus is compiled, we split the data into monthly intervals, starting from January 2020. Data for the months between 2018-01 and 2019-12 is aggregated into a single period (i.e., ‘up to 2020’), to establish prior baselines. While in Section 4 we provide details of our main online visualization platform, we also provide an easy-to-use Colab notebook777https://tinyurl.com/8pvb7wvb with code to access the aggregated time series data in a programmatic way and create custom data visualizations. In the following we describe the four main aggregated time series available in Tweet Insights, namely frequency, embedding distance, sentiment and topic distribution.
3.2.1 Frequency
The frequency time series is based on n-gram counts obtained from the full collection of 220M tweets. The frequency of each word in the vocabulary is aggregated monthly. Bigram and trigram frequencies add to their respective unigram components (e.g., an occurrence for Joe Biden also increments counts for Joe and Biden). Our resource includes both absolute and normalized (words per million) counts.
3.2.2 Embedding distance
In order to compare embedding similarity for the same n-gram at different time periods, we employ the TWEC method proposed by Di Carlo et al. (2019) to efficiently learn diachronic word embeddings888Based on word2vec embeddings (300-d, min_count=10). for every month covered in our corpus. As done for training the RoBERTa language models (described in 3.2), this step also applies the same preprocessing to improve the resulting embeddings. Considering that TWEC starts by learning a so-called ‘compass’ set of embeddings from the concatenation of text from all periods, we undersample more recent months (which have additional data) in order to learn compass embeddings from a corpus that is balanced across monthly intervals. The atemporal compass embeddings are then used to initialize embeddings for particular time periods, which are tuned for their periods by re-training on data corresponding to their temporal interval.
The time series resulting from these embeddings features the cosine distance between the embedding from the first month with available data (the embedding prior, corresponding to ‘up to 2020’ for most n-grams), and embeddings corresponding to each subsequent month. Consequently, there are no cosine distances to report for the first months with available data.
3.2.3 Sentiment
We use the Twitter-trained RoBERTa-base language model and fine-tune it for sentiment classification to obtain negative, neutral, and negative scores for every tweet in our corpus. Taking the SemEval-2017 sentiment analysis dataset as a reference Rosenthal et al. (2017), the model achieves 73.7 macro-averaged recall and outperforms other general-purpose models Camacho-Collados et al. (2022), while being of a reasonable size to run millions of inferences. Since our goal is to provide the most accurate predictions, for inference we use a model fine-tuned on all splits of the SemEval dataset.
With predictions for all tweets, we compile our sentiment time series as the mean prediction scores for tweets where a particular n-gram occurs. To make comparisons more reliable across monthly intervals, for each n-gram, we set a minimum threshold of 10 tweets per month and compute the mean scores considering a sample of up to 1,024 tweets per month (depending on availability).

3.2.4 Topic distribution
We obtain topic distributions for every tweet in our corpus using the same TimeLMs model used for sentiment prediction, but fine-tuned on the multi-label topic classification dataset of Antypas et al. (2022)999Covers 19 topics: ‘arts & culture’, ‘business & entrepreneurs’, ‘celebrity & pop culture’, ‘diaries & daily life’, ‘family’, ‘fashion & style’, ‘film tv & video’, ‘fitness & health’, ‘food & dining’, ‘gaming’, ‘learning & educational’, ‘music’, ‘news & social concern’, ‘other hobbies’, ‘relationships’, ‘science & technology’, ‘sports’, ‘travel & adventure’, ‘youth & student life’., again using all data splits. As with sentiment, the time series resulting from topic predictions corresponds to the average of the mean scores for particular n-grams. For a better visualization experience, our online demo only shows the top four topics for a particular n-gram, determined from the sum of topic distributions over all periods.


4 Tweet Insights Demo
Figure 1 shows an example of the demo visualization interface. Based on our collected vocabulary and data, as described in the previous section, our demo provides a temporal line plot visualisations from January 2020 onward at monthly intervals. For frequency, similarity, and sentiment visualization, our demo includes the option to visualize multiple words in the same plot. Tweet Insights demo is available at https://tweetnlp.org/insights.
In this section, we also explore a few case studies using our proposed time series visualizations. As our case studies, we include visualizations for three words (parasite, folklore and macron) in Figure 2 and Figure 3, which have been selected to highlight different insights made available by our resource. Below we describe the some interesting aspects of these cases in detail.
Case 1: parasite.
The frequency plot in Figure 2 shows a large spike in frequency for this word around 2020-02, when a movie of the same name won 4 Oscars. This peak was not sustained for long, and three months later, frequency returned to values much closer to the prior of ‘up to 2020’. The connection between this popularity peak and the movie is corroborated by a similarly timed peak in the cosine distance plot, indicating a divergence of the word’s associations during this period in relation to its more typical associations. Interestingly, while the frequency in 2020-02 is double that of 2020-01, the cosine distance plot shows that in 2020-01 its meaning associations had already clearly shifted. The sentiment plot shows the percentage of tweets that were classified as positive by the model. From it, we can see that the percentage of positive sentiment increased during the awards, but then quickly returned to lower sentiment scores identical to its prior. Given the decrease in cosine distance closely following that drop in positive sentiment, we can attribute the loss to the word’s return to its typical associations (i.e., disease). Finally, the topic plot in Figure 3 further confirms that the movie awards are responsible for the frequency and positivity peak, showing a topic related to ‘film tv & video’ briefly overtaking the word’s typical ‘news & social concern’ topic. These results, in combination, provide evidence in support of meaning shift for the word ‘parasite’ during its trending period.
Case 2: folklore.
Similarly to the previous case study, this word also exhibits a relatively short period of increased popularity followed by a return to frequencies close to its prior. However, in this case, we can see that the cosine distance plot shows a sustained divergence in relation to its prior. Additionally, and most clearly, its corresponding topic plot in Figure 3 shows the ‘music’ topic becoming dominant after both the rise and fall of its popularity peak. These results suggest that even though the event leading to the increased frequency of this word didn’t sustain its popularity, it did lead to a sustained meaning shift, potentially resulting in significantly different word meaning distributions in tweets containing this word sampled before and after the trending event.
Case 3: macron.
We include this word as a case study to show that Tweet Insights also allows for determining whether a popularity peak isn’t associated to meaning shift. The frequency plot shows that this word peaked during the 2022 French Presidential Election. However, the other more semantically relevant plots do not show significant change.
5 Conclusion
In this paper, we presented Tweet Insights, a tool to display temporal tendencies on Twitter at the word level considering four distinct but complementary perspectives: frequency, embedding similarity, sentiment, and topic distribution. To calculate all these signals on a large Twitter corpus, we use word embeddings and language models specialized on social media. Through our tool, users can analyse trends more deeply than using Google Trends or alternative frequency-based solutions, benefiting from recent advances in state-of-the-art NLP methods. While this work focuses on using our proposed time series to detect meaning shift associated to trending events, other applications should also benefit from this resource.
Limitations
As far as limitations are concerned, our platform is limited by several factors. First, the selection of the vocabulary was performed by following some statistical methods that rely on co-occurrence information, and therefore some words/entities may be missing. Second, we focus on a specific time period and visualizations are only available for recent periods from 2020. Since the writing of this paper, the Twitter Academic API has been discontinued. We are currently looking at measures to expand the demo to future time periods. Third, the aggregated information may be affected by several factors such as the reliance on specific word embedding methods and fine-tuned language models in the case of sentiment and topic classification, as well as various types of ambiguity. Finally, the platform is currently only available for English. While we plan to extend it for a few more languages in the near future, this requires a non-trivial amount of extra development and testing.
Ethics Statement
This paper presents a visualization demo on user-generated data from Twitter. For this, we have followed all current Twitter regulations with respect to gathering and analysing Twitter data. Moreover, we have removed all URLs and non-verified Twitter mentions from tweets prior to the analysis.
References
- Alshaabi et al. (2021) Thayer Alshaabi, Jane L. Adams, Michael V. Arnold, Joshua R. Minot, David R. Dewhurst, Andrew J. Reagan, Christopher M. Danforth, and Peter Sheridan Dodds. 2021. Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using twitter. Science Advances, 7(29):eabe6534.
- Antypas et al. (2022) Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Vitor Silva, Leonardo Neves, and Francesco Barbieri. 2022. Twitter topic classification. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3386–3400, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Baldwin et al. (2013) Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 356–364.
- Barbieri et al. (2020) Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. 2020. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. Findings of EMNLP.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python, 1st edition. O’Reilly Media, Inc.
- Camacho-Collados et al. (2022) Jose Camacho-Collados, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa Anke, Fangyu Liu, and Eugenio Martínez Cámara. 2022. TweetNLP: Cutting-edge natural language processing for social media. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–49, Abu Dhabi, UAE. Association for Computational Linguistics.
- Del Tredici et al. (2019) Marco Del Tredici, Raquel Fernández, and Gemma Boleda. 2019. Short-term meaning shift: A distributional exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2069–2075, Minneapolis, Minnesota. Association for Computational Linguistics.
- Di Carlo et al. (2019) Valerio Di Carlo, Federico Bianchi, and Matteo Palmonari. 2019. Training temporal word embeddings with a compass. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6326–6334.
- Dodds et al. (2015) Peter Sheridan Dodds, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth. 2015. Human language reveals a universal positivity bias. Proceedings of the National Academy of Sciences, 112(8):2389–2394.
- Dredze et al. (2016) Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, and Miles Osborne. 2016. How twitter is changing the nature of financial news discovery. In proceedings of the second international workshop on data science for macro-modeling, pages 1–5.
- Kulkarni et al. (2015) Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, page 625–635, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
- Lim et al. (2015) Joon Soo Lim, YoungChan Hwang, Seyun Kim, and Frank A Biocca. 2015. How social media engagement leads to sports channel loyalty: Mediating roles of social presence and channel commitment. Computers in Human Behavior, 46:158–167.
- Loureiro et al. (2022a) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-collados. 2022a. TimeLMs: Diachronic language models from Twitter. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 251–260, Dublin, Ireland. Association for Computational Linguistics.
- Loureiro et al. (2022b) Daniel Loureiro, Aminette D’Souza, Areej Nasser Muhajab, Isabella A. White, Gabriel Wong, Luis Espinosa-Anke, Leonardo Neves, Francesco Barbieri, and Jose Camacho-Collados. 2022b. TempoWiC: An evaluation benchmark for detecting meaning shift in social media. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3353–3359, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
- Mitra et al. (2014) Sunny Mitra, Ritwik Mitra, Martin Riedl, Chris Biemann, Animesh Mukherjee, and Pawan Goyal. 2014. That’s sick dude!: Automatic identification of word sense change across different timescales. In Annual Meeting of the Association for Computational Linguistics.
- Morgan and Van Keulen (2014) Mena Badieh Habib Morgan and Maurice Van Keulen. 2014. Information extraction for social media. In Proceedings of the Third Workshop on Semantic Web and Information Extraction, pages 9–16.
- O’Connor et al. (2010) Brendan O’Connor, Michel Krieger, and David Ahn. 2010. Tweetmotif: Exploratory search and topic summarization for twitter. In Proceedings of the International AAAI Conference on Web and Social Media, volume 4, pages 384–385.
- Pfeffer et al. (2023) Juergen Pfeffer, Daniel Matter, Kokil Jaidka, Onur Varol, Afra Mashhadi, Jana Lasser, Dennis Assenmacher, Siqi Wu, Diyi Yang, Cornelia Brantner, Daniel M. Romero, Jahna Otterbacher, Carsten Schwemmer, Kenneth Joseph, David Garcia, and Fred Morstatter. 2023. Just another day on twitter: A complete 24 hours of twitter data.
- Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.
- Rosenthal et al. (2017) Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 502–518, Vancouver, Canada. Association for Computational Linguistics.
- Wohn and Na (2011) D Yvette Wohn and Eun-Kyung Na. 2011. Tweeting about tv: Sharing television viewing experiences via social media message streams. First Monday.
- Zhuravskaya et al. (2020) Ekaterina Zhuravskaya, Maria Petrova, and Ruben Enikolopov. 2020. Political effects of the internet and social media. Annual Review of Economics, 12:415–438.
Appendix A Corpus Temporal Distribution
In this appendix, Figure 4 and Figure 5 demonstrate the similar temporal distribution of tweets underlying our various time series. The distributions are displayed by month, day, hour and minute.

